Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2005-12-19 16:57:55


Carsten,

In the Open MPI source code directory there is a collective component
called tuned (ompi/mca/coll/tuned). This component is not enabled by
default right now, but usually it give better performances than the
basic one. You should give it a try (go inside and remove
the .ompi_ignore file and redo the autogen and configure).

I don't see how you deduct that adding barriers increase the
congestion ? It increase the latency for the all-to-all but for me
that make sense. For each pair of message that you send (and these
pair will send them in parallel) you add a global synchronization on
top of it (depend on the algorithm used for the barrier but it can
hardly be pipelined with others communications). If you have a
hardware barrier it can help, but over TCP or any other p2p network
it won't.

Anyway, the algorithm you describe with the MPI_Sendrecv act as an
implicit barrier as they all wait for the other at some point. What's
happens if you make sure that all MPI_Sendrecv act only between 2
nodes at each moment (make [source:destination] an unique tuple) ?

   Thanks,
     george.

On Dec 19, 2005, at 7:26 AM, Carsten Kutzner wrote:

> Hello,
>
> I am desparately trying to get better all-to-all performance on Gbit
> Ethernet (flow control is enabled). I have been playing around with
> several all-to-all schemes and been able to reduce congestion by
> communicating in an ordered fashion.
>
> E.g. the simplest scheme looks like
>
> for (i=0; i<ncpu; i++)
> {
> /* send to dest */
> dest = (cpuid + i) % ncpu;
> /* receive from source */
> source = (ncpu + cpuid - i) % ncpu;
>
> MPI_Sendrecv(sendbuf+dest *sendcount, sendcount, sendtype,
> dest , 0,
> recvbuf+source*recvcount, recvcount, recvtype,
> source, 0,
> comm, &status);
> }
>
> For sendcount=32768 and sendtype=float (yields 131072 bytes) the
> time such
> an all-to-all takes is (average over 100 runs, std deviation in () ):
>
> SENDRECV ALLTOALL on 16 PROCS
> 32768 floats took 0.036783 (0.008798) seconds. Min: 0.034175
> max: 0.123684
> SENDRECV ALLTOALL on 32 PROCS
> 32768 floats took 0.082687 (0.035920) seconds. Min: 0.071915
> max: 0.285299
>
> For comparison:
> MPI_Alltoall on 16 PROCS
> 32768 floats took 0.057936 (0.073605) seconds. Min: 0.027218
> max: 0.275988
> MPI_Alltoall on 32 PROCS
> 32768 floats took 0.137835 (0.100580) seconds. Min: 0.055607
> max: 0.412144
>
> The sendrecv all-to-all performs better for these message sizes, but
> on 32 CPUs (on 32 nodes) there is still congestion. When I try to
> separate
> the communication phases by putting an MPI_Barrier(MPI_COMM_WORLD)
> after
> the sendrecv, this makes the problem of congestion even worse:
>
> SENDRECV ALLTOALL on 32 PROCS, with Barrier:
> 32768 floats took 0.179162 (0.136885) seconds. Min: 0.091028
> max: 0.729049
>
> How can a barrier lead to more congestion???
>
> Thanks in advance for helpful comments,
> Carsten
>
>
> ---------------------------------------------------
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry
> Theoretical and Computational Biophysics Department
> Am Fassberg 11
> 37077 Goettingen, Germany
> Tel. +49-551-2012313, Fax: +49-551-2012302
> eMail ckutzne_at_[hidden]
> http://www.gwdg.de/~ckutzne
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran