Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: George Bosilca (bosilca_at_[hidden])
Date: 2005-12-20 17:46:38


On Dec 20, 2005, at 3:19 AM, Carsten Kutzner wrote:

>> I don't see how you deduct that adding barriers increase the
>> congestion ? It increase the latency for the all-to-all but for me
>
> When I do an all-to-all a lot of times, I see that the time for a
> single
> all-to-all varies a lot. My time measurement:
>
> do 100 times
> {
> MPI_Barrier
> MPI_Wtime
> ALLTOALL
> MPI_Barrier
> MPI_Wtime
> }

This way of computing the time for collective operations is not
considered as the best approach. Even for p2p communications if you
time them like that, you will find a huge standard deviation. Way too
many things are involved in any communications, and they usually have
a big effect on the duration. For collectives the effect of this
approach on standard deviation is even more drastic. A better way is
to split the loop in 2 loops:

   do 10 times
   {
     MPI_Barrier
     start <- MPI_Wtime
     do 10 times
     {
       ALLTOALL
     }
     end <- MPI_Wtime
     total_time = (end - start) / 10
     MPI_Barrier
   }

You will get results that make more sense. There is another problem
with your code. If we look on how the MPI standard define the
MPI_Barrier, we can see that the only requirement is that all nodes
belonging to the same communicator reach the barrier. It does not
means they leave the barrier in same time ! It depend on how the
barrier is implemented. If it use a linear approach (node 0 get a
message from everybody else and then send a message to everybody
else), it is clear that the node 0 has more chances to get out of the
barrier last. Therefore, when he will reach the next ALLTOALL, the
messages will be already there, as all the others nodes are on the
alltoall. Now, as he reach the alltoall later, imagine the effect
that it will have on the communications between the others nodes. If
it late enough, then there will be congestion as all others will be
waiting for a sendrecv with the node 0.

There are others approaches for performance measurement, but they are
more complex. The one I will describe give correct results with a
fairly simple algorithm. What people usually do for measuring
performances is that after filling up the array with their individual
results, and before computing the mean-time they remove the best and
the worst results (the 2 extremum). They can be considered as
anomalies. If there are several "worst" then they will show up anyway
in the standard deviation as you will remove just one.

     george.

>
> For the ring-sendrecv all-to-all I get something like
> ...
> sending 131072 bytes to 32 processes took ... 0.06433 seconds
> sending 131072 bytes to 32 processes took ... 0.06866 seconds
> sending 131072 bytes to 32 processes took ... 0.06233 seconds
> sending 131072 bytes to 32 processes took ... 0.26683 seconds (*)
> sending 131072 bytes to 32 processes took ... 0.06353 seconds
> sending 131072 bytes to 32 processes took ... 0.06470 seconds
> sending 131072 bytes to 32 processes took ... 0.06483 seconds
> Summary (100-run average, timer resolution 0.000001):
> 32768 floats took 0.068903 (0.028432) seconds. Min: 0.061708 max:
> 0.266832
>
> The typical time my all-to-all takes is around 0.065 seconds, while
> sometimes (*) it takes 0.2+ seconds more. This I interpret as
> congestion.

It can be congestion ...

>
> When I add a barrier after the MPI_Sendrecv inside the alltoall, I get
> many more of these congestion events:
> ...
> sending 131072 bytes to 32 processes took ... 0.11023 seconds
> sending 131072 bytes to 32 processes took ... 0.48874 seconds
> sending 131072 bytes to 32 processes took ... 0.27856 seconds
> sending 131072 bytes to 32 processes took ... 0.27711 seconds
> sending 131072 bytes to 32 processes took ... 0.31615 seconds
> sending 131072 bytes to 32 processes took ... 0.07439 seconds
> sending 131072 bytes to 32 processes took ... 0.07440 seconds
> sending 131072 bytes to 32 processes took ... 0.07490 seconds
> sending 131072 bytes to 32 processes took ... 0.27524 seconds
> sending 131072 bytes to 32 processes took ... 0.07464 seconds
> Summary (100-run average, timer resolution 0.000001):
> 32768 floats took 0.250027 (0.158686) seconds. Min: 0.072322 max:
> 0.970822
>
> Indeed, the all-to-all time has increased from 0.065 to 0.075
> seconds by
> the barrier, but the most severe problem is congestion as it happens
> nearly every step now.
>
>> Anyway, the algorithm you describe with the MPI_Sendrecv act as an
>> implicit barrier as they all wait for the other at some point. What's
>> happens if you make sure that all MPI_Sendrecv act only between 2
>> nodes at each moment (make [source:destination] an unique tuple) ?
>
> I actually already have tried this. But I get worse timings
> compared to
> the ring pattern, what I don't understand. I now choose
> /* send to dest */
> dest = m[cpuid][i];
> /* receive from source */
> source = dest;
>
> With a matrix m chosen such that each processor pair communicates in
> exactly one phase. I get
>
> Without barrier:
> sending 131072 bytes to 32 processes took ... 0.07872 seconds
> sending 131072 bytes to 32 processes took ... 0.07667 seconds
> sending 131072 bytes to 32 processes took ... 0.07637 seconds
> sending 131072 bytes to 32 processes took ... 0.28047 seconds
> sending 131072 bytes to 32 processes took ... 0.28580 seconds
> sending 131072 bytes to 32 processes took ... 0.28156 seconds
> sending 131072 bytes to 32 processes took ... 0.28533 seconds
> sending 131072 bytes to 32 processes took ... 0.07763 seconds
> sending 131072 bytes to 32 processes took ... 0.27871 seconds
> sending 131072 bytes to 32 processes took ... 0.07749 seconds
> Summary (100-run average, timer resolution 0.000001):
> 32768 floats took 0.186031 (0.140984) seconds. Min: 0.075035 max:
> 0.576157
>
> With barrier:
> sending 131072 bytes to 32 processes took ... 0.08342 seconds
> sending 131072 bytes to 32 processes took ... 0.08432 seconds
> sending 131072 bytes to 32 processes took ... 0.08378 seconds
> sending 131072 bytes to 32 processes took ... 0.08412 seconds
> sending 131072 bytes to 32 processes took ... 0.08312 seconds
> sending 131072 bytes to 32 processes took ... 0.08365 seconds
> sending 131072 bytes to 32 processes took ... 0.08332 seconds
> sending 131072 bytes to 32 processes took ... 0.08376 seconds
> sending 131072 bytes to 32 processes took ... 0.08367 seconds
> sending 131072 bytes to 32 processes took ... 0.32773 seconds
> Summary (100-run average, timer resolution 0.000001):
> 32768 floats took 0.107121 (0.066466) seconds. Min: 0.082758 max:
> 0.357322
>
> In the case of paired communication the barrier improves stuff. Let me
> stress that both paired and ring communication show no congestion
> for up
> to 16 nodes. The problem arises in the 32 CPU case. It should not
> be due
> to the switch, since it has 48 ports and a 96 Gbit/s backplane.
>
> Does all this mean the congestion problem cannot be solved for
> Gbit Ethernet?
>
> Carsten
>
>
> ---------------------------------------------------
> Dr. Carsten Kutzner
> Max Planck Institute for Biophysical Chemistry
> Theoretical and Computational Biophysics Department
> Am Fassberg 11
> 37077 Goettingen, Germany
> Tel. +49-551-2012313, Fax: +49-551-2012302
> eMail ckutzne_at_[hidden]
> http://www.gwdg.de/~ckutzne
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

"Half of what I say is meaningless; but I say it so that the other
half may reach you"
                                   Kahlil Gibran