Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Carsten Kutzner (ckutzne_at_[hidden])
Date: 2005-12-23 09:19:28


On Tue, 20 Dec 2005, George Bosilca wrote:

> On Dec 20, 2005, at 3:19 AM, Carsten Kutzner wrote:
>
> >> I don't see how you deduct that adding barriers increase the
> >> congestion ? It increase the latency for the all-to-all but for me
> >
> > When I do an all-to-all a lot of times, I see that the time for a
> > single
> > all-to-all varies a lot. My time measurement:
> >
> > do 100 times
> > {
> > MPI_Barrier
> > MPI_Wtime
> > ALLTOALL
> > MPI_Barrier
> > MPI_Wtime
> > }
>
> This way of computing the time for collective operations is not
> considered as the best approach. Even for p2p communications if you
> time them like that, you will find a huge standard deviation. Way too
> many things are involved in any communications, and they usually have
> a big effect on the duration. For collectives the effect of this
> approach on standard deviation is even more drastic. A better way is
> to split the loop in 2 loops:
>
> do 10 times
> {
> MPI_Barrier
> start <- MPI_Wtime
> do 10 times
> {
> ALLTOALL
> }
> end <- MPI_Wtime
> total_time = (end - start) / 10
> MPI_Barrier
> }
>
> You will get results that make more sense. There is another problem

Hi George,

thanks for pointing out better ways to measure MPI performance. I get
slightly faster timings this way, clearly 10 alltoalls in a row can
execute faster than 10 barrier-separated alltoalls (even without
counting the barriers).

On the other hand, this smoothing actually hides the real problem: I have
a code that, besides doing calculations, executes two all-to-alls every
time step. These alltoalls normally execute in around 0.065 seconds, but
sometimes they need around 0.25 seconds (always same data volume). Since
the whole time step is only around 0.5 seconds long (or much less on
a large number of CPUs), I do not gain anything anymore by running in
parallel, if the alltoall calls are delayed for some reason.

Because of this I initially decided to measure the duration of the
alltoall by putting it between barriers and leaving away the rest of my code.
What I then get is a bi-modal distribution: one part of the time values clusters
around e.g. 0.065 seconds while the rest of the values cluster around 0.25 seconds.
If the typical alltoall executes in 0.065 seconds, why not (nearly) all?

If I look at MPE logfiles of my ring-Sendrecv alltoall (see attachment,
x=time, y=processor, yellow=MPI_Barrier, green=MPI_Sendrecv,
arrows=messages), then most of the Sendrecvs are fast, while just
individual ones are delayed by 0.2 seconds (=congestion (?)). There are
more delayed Sendrecvs when there is a barrier between them.
There must be a way to eliminate these delays.

> with your code. If we look on how the MPI standard define the
> MPI_Barrier, we can see that the only requirement is that all nodes
> belonging to the same communicator reach the barrier. It does not
> means they leave the barrier in same time ! It depend on how the
> barrier is implemented. If it use a linear approach (node 0 get a
> message from everybody else and then send a message to everybody
> else), it is clear that the node 0 has more chances to get out of the
> barrier last. Therefore, when he will reach the next ALLTOALL, the
> messages will be already there, as all the others nodes are on the
> alltoall. Now, as he reach the alltoall later, imagine the effect
> that it will have on the communications between the others nodes. If
> it late enough, then there will be congestion as all others will be
> waiting for a sendrecv with the node 0.

Yes, but in the ring-Sendrecv case with barrier just one node waits for a
send *from* node 0 and one other sends *to* node 0. Only after node0 is
itself in the barrier all the nodes may proceed to the next communication phase.
To my understanding this should never cause problems.

I have also tried the tuned alltoalls and they are really great!! Only for
very few message sizes in the case of 4 CPUs on a node one of my alltoalls
performed better. Are these tuned collectives ready to be used for
production runs?

  Carsten



mpe.jpg