Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Carsten Kutzner (ckutzne_at_[hidden])
Date: 2005-12-02 06:05:05


I initiated this thread on the LAM mailing list, but as Brian Barrett
suggested, I am now reporting my results here. Hope this is of interest!

My motivation is to get good performance of the MPI_Alltoall routine on
our Gigabit Ethernet clusters. Most users here run the GROMACS molecular
dynamics code and it turned out that the MPI_Alltoall routine is one
of the main scaling bottlenecks (on Ethernet), at least when flow control
is not enabled. Then basically the code does not scale beyond two computer

I wrote a test program that performs MPI_Alltoall communication for
varying message sizes (see attachment) to find out where congestion
occurs. This I originally tested for LAM 7.1.1 and you can see my results
in The plots show the transmission rate per MPI process as a
function of the message size. Left: 1 CPU-nodes, right 2 CPU-nodes, top:
without flow control, middle: with flow control. (Horizontal broken line
indicates Ethernet max. throughput)

Because even with flow control you get congestion for 16+ nodes, I tried
out ordered communication schemes, which should in principle totally avoid
congestion. The result for a simple 1-CPU-node Sendrecv-based scheme is
seen in the lower left plot, the result for a more complex multi-CPU node
Isend/Irecv scheme in the lower right plot. Hardware flow control was
disabled for the ordered communication schemes. Nice thing is, there is no
congestion any more, but due to the barriers in the code, the transmission
rates you can reach with the ordered all-to-alls is lower than with the
MPI_Alltoall when there is no congestion. When the MPI_Alltoall shows
congestion, the ordered all-to-all clearly wins.

As Brian suggested, I repeated my tests with OpenMPI 1.0, see
The OpenMPI MPI_Alltoall shows less congestion than the LAM MPI_Alltoall,
both with and without flow control. Unfortunately the ordered routines
even perform worse than in the LAM case.

Here are some numbers for the transmission rate T per CPU:

Table 1. Message size=2 MB, 4 CPUs "limit for large messages":
CPUs/node LAM OpenMPI
           MPI_Alltoall own all-to-all MPI_Alltoall own all-to-all
           (flow contr) (no flow c.) (flow contr) (no flow c.)
    1 68.2 MB/s 69.6 MB/s 64.0 MB/s 64.0 MB/s
    2 48.9 MB/s 45.3 MB/s 48.7 MB/s 37.4 MB/s

Table 2. Message size=1024 byte, 32 CPUs. (Note: here the MPI_Alltoall
does not show congestion when flow control is enabled):
CPUs/node LAM OpenMPI
           MPI_Alltoall own all-to-all MPI_Alltoall own all-to-all
           (flow contr) (no flow c.) (flow contr) (no flow c.)
    1 20.9 MB/s 2.8 MB/s 18.2 MB/s 1.3 MB/s
    2 14.3 MB/s 5.0 MB/s 14.3 MB/s 2.7 MB/s

While for large messages the transmission rates between the MPI and the
ordered all-to-alls become comparable, for smaller message sizes the
ordered routines perform worse, due to the barriers between the
communication phases. However, I do not understand, why the transmission
rate of the ordered all-to-alls does so heavily depend upon the number of
CPUs in the OpenMPI case, but not in the LAM case. Are maybe the barrier
synchronization times in OpenMPI longer than in LAM?

Any ideas, how to rise the throughput without risking congestion, are


Dr. Carsten Kutzner
Max Planck Institute for Biophysical Chemistry
Theoretical and Computational Biophysics Department
Am Fassberg 11
37077 Goettingen, Germany
Tel. +49-551-2012313, Fax: +49-551-2012302
eMail ckutzne_at_[hidden]

---------- Forwarded message ----------
Date: Wed, 2 Nov 2005 08:23:15 -0500
From: Brian Barrett <brbarret_at_[hidden]>
Reply-To: General LAM/MPI mailing list <lam_at_[hidden]>
To: General LAM/MPI mailing list <lam_at_[hidden]>
Subject: Re: LAM: MPI_Alltoall performance and congestion

On Nov 2, 2005, at 7:44 AM, Carsten Kutzner wrote:

> Will a new all-to-all routine be implemented in a future version
> of LAM / OpenMPI? I am willing to contribute my code as well
> if there is interest.

We will probably not be doing any more work on LAM/MPI's collective
routines. While clearly not optimal for all cases (as you have
experienced), they do appear to be correct. At this point, we're
hesitant to do anything to regress from a correctness standpoint.

However, we are actively working on improving collective performance
in Open MPI. Our collective setup in Open MPI is a bit different
than the one in LAM, and some of the algorithms are already quite
different. The FT-MPI team from University of Tennessee is also
working on some new routines that should give better performance in a
wider range of scenarios. We would be happy to have contributions
that help improve performance in certain situations - if nothing
else, it gives us a good reference point for our work. If you are
interested, I would highly recommend trying out one of the Open MPI
release candidates, subscribing to the Open MPI developer's mailing
list, and letting us know. The Open MPI web page is, of course,



   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day:
This list is archived at

  • APPLICATION/PostScript attachment: