On 02/03/2012 01:46 PM, Tom Rosmond wrote:
> Recently the organization I work for bought a modest sized Linux cluster
> for running large atmospheric data assimilation systems. In my
> experience a glaring problem with systems of this kind is poor IO
> performance. Typically they have 2 types of network: 1) A high speed,
(we are biased given that we design, build, and sell these types of high
Couldn't agree more, high performance in storage systems is often
mostly/completely overlooked, and when not, poorly designed/implemented.
Its rare we see good implementations.
> low latency, e.g. Infiniband, network dedicated to MPI communications,
> and, 2) A lower speed network, e.g 1Gb or 10Gb ethernet, for IO. On
> clusters this second network is usually the basis for a global parallel
> file system (GPFS), through which nearly all IO traffic must pass. So
> the IO performance of applications such as ours is completely dependent
> on the speed of the GPFS, and therefore on the network hardware it uses.
Ouch. 10GbE should be pretty reasonable for high performance traffic,
as long as your servers have multiple ports that can be accessed in
parallel, and that they are capable of driving data at these rates.
When people do focus on the issues mentioned, they usually solve the
first of these, and their design to handle the second is broken.
> We have seen that a cluster with a GPFS based on a 1Gb network is
> painfully slow for our applications, and of course with a 10Gb network
> is much better. Therefore we are making the case to the IT staff that
> all our systems should have GPFS running on 10Gb networks. Some of them
> have a hard time accepting this, since they don't really understand the
> requirements of our applications.
Well, yes. I don't expect someone used to dealing with desktops, mail
servers, etc. to really grasp why you need a very high bandwidth very
low latency network for messages or storage. If you see their stacked
switch architectures, you might shudder to think that these are
replicated in cluster environments (they are when IT has a hand in
> With all of this, here is my MPI related question. I recently added an
> option to use MPI-IO to do the heavy IO lifting in our applications. I
> would like to know what the relative importance of the dedicated MPI
> network vis-a-vis the GPFS network for typical MPIIO collective reads
> and writes. I assume there must be some hand-off of data between the
> networks during the process, but how is it done, and are there any rules
> to help understand it. Any insights would be welcome.
We'd recommend all storage be used over the fast networks (10GbE and
Infiniband). Fairly painless to configure. This said, I guess one of
my concerns would be the design of the storage servers ... I am guessing
they may not be able to even come close to saturating a 10GbE
connection, never mind an Infiniband connection.
All this said, you might find an audience that can help with the
operational aspects of this better on the beowulf list than here on the
OpenMPI list. Certainly OpenMPI list is a great place to discuss the
MPI-IO side of things, but the performance side of the system design
(outside of MPI) might be better located on a different list.
Just a suggestion ...
> T. Rosmond
> P.S. I am running with Open-mpi 1.4.2.
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
web : http://scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615