Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Graham E Fagg (fagg_at_[hidden])
Date: 2006-01-03 09:52:30


Hello Carsten
  happy new year to you too.

On Tue, 3 Jan 2006, Carsten Kutzner wrote:

> Hi Graham,
>
> sorry for the long delay, I was on Christmas holidays. I wish a Happy New
> Year!
>

> (Uh, I think the previous email did not arrive in my postbox (?)) But yes,

I am resending it after this reply

> also the OMPI tuned all-to-all shows this strange performance behaviour
> (i.e. sometimes it's fast, sometimes it's delayed for 0.2 or more
> seconds). For message sizes where the delays occur, I am sometimes able to
> do better with an alternative all-to-all routine. It sets up the same
> communication pattern as the pairbased sendrecv all-to-all but not on the
> basis of the CPUs but on the basis of the nodes. The core looks like

So its equivalent to a batch style operation, each CPU does procs_pn*2
operations per step and there are just nnodes steps. (Its the same
communication pattern as before on a CPU by CPU pairwise, except the final
sync is the waitall on the 'set' of posted receives)?

>
> /* loop over nodes */
> for (i=0; i<nnodes; i++)
> {
> destnode = ( nodeid + i) % nnodes; /* send to destination node */
> sourcenode = (nnodes + nodeid - i) % nnodes; /* receive from source node */
> /* loop over CPUs on each node */
> for (j=0; j<procs_pn; j++) /* 1 or more processors per node */
> {
> sourcecpu = sourcenode*procs_pn + j; /* source of data */
> destcpu = destnode *procs_pn + j; /* destination of data */
> MPI_Irecv(recvbuf + sourcecpu*recvcount, recvcount, recvtype, sourcecpu, 0, comm, &recvrequests[j]);
> MPI_Isend(sendbuf + destcpu *sendcount, sendcount, sendtype, destcpu , 0, comm, &sendrequests[j]);
> }
> MPI_Waitall(procs_pn,sendrequests,sendstatuses);
> MPI_Waitall(procs_pn,recvrequests,recvstatuses);
> }

Is it possible to put the send and recv request handles in the same array
and then do a waitall on them in a single op. It shouldn't make too much
difference as the recvs are all posted (I hope) before the waitall takes
effect but it would be interesting to see if internally their is an effect
from combining them.

> I tested for message sizes of 4, 8, 16, 32, ... 131072 byte that are to be
> sent from each CPU to every other, and for 4, 8, 16, 24 and 32 nodes (each
> node has 1, 2 or 4 CPUs). While in general the OMPI all-to-all performs
> better, the alternative one performs better for the following message
> sizes:
>
> 4 CPU nodes:
> 128 CPUs on 32 nodes: 512, 1024 byte
> 96 CPUs on 24 nodes: 512, 1024, 2048, 4096, 16384 byte
> 64 CPUs on 16 nodes: 4096 byte
>
> 2 CPU nodes:
> 64 CPUs on 32 nodes: 1024, 2048, 4096, 8192 byte
> 48 CPUs on 24 nodes: 2048, 4096, 8192, 131072 byte
>
> 1 CPU nodes:
> 32 CPUs on 32 nodes: 4096, 8192, 16384 byte
> 24 CPUs on 24 nodes: 8192, 16384, 32768, 65536, 131072 byte

Except for the 128K message on 48/24 nodes there appears to be some well
defined pattern here. It appears more like a buffering side effect than
contention.. if it was pure contension then at larger message sizes the
128/32 node example is putting more stress on the switch (more pairs
communicating and larger data per pair means the chance for contention
is higher).

Do you have any tools such as Vampir (or its Intel equivalent) available
to get a time line graph ? (even jumpshot of one of the bad cases such as
the 128/32 for 256 floats below would help).

(GEORGE, can you run a GigE test for 32 nodes using slog etc and send me
the data)

> Here is an example measurement for 128 CPUs on 32 nodes, averages taken
> over 25 runs, not counting the 1st one. Performance problems marked by a
> (!):
>
> OMPI tuned all-to-all:
> ======================
> mesg size time in seconds
> #CPUs floats average std.dev. min. max.
> 128 1 0.001288 0.000102 0.001077 0.001512
> 128 2 0.008391 0.000400 0.007861 0.009958
> 128 4 0.008403 0.000237 0.008095 0.009018
> 128 8 0.008228 0.000942 0.003801 0.008810
> 128 16 0.008503 0.000191 0.008233 0.008839
> 128 32 0.008656 0.000271 0.008084 0.009177
> 128 64 0.009085 0.000209 0.008757 0.009603
> 128 128 0.251414 0.073069 0.011547 0.506703 !
> 128 256 0.385515 0.127661 0.251431 0.578955 !
> 128 512 0.035111 0.000872 0.033358 0.036262
> 128 1024 0.046028 0.002116 0.043381 0.052602
> 128 2048 0.073392 0.007745 0.066432 0.104531
> 128 4096 0.165052 0.072889 0.124589 0.404213
> 128 8192 0.341377 0.041815 0.309457 0.530409
> 128 16384 0.507200 0.050872 0.492307 0.750956
> 128 32768 1.050291 0.132867 0.954496 1.344978
> 128 65536 2.213977 0.154987 1.962907 2.492560
> 128 131072 4.026107 0.147103 3.800191 4.336205
>
> alternative all-to-all:
> ======================
> 128 1 0.012584 0.000724 0.011073 0.015331
> 128 2 0.012506 0.000444 0.011707 0.013461
> 128 4 0.012412 0.000511 0.011157 0.013413
> 128 8 0.012488 0.000455 0.011767 0.013746
> 128 16 0.012664 0.000416 0.011745 0.013362
> 128 32 0.012878 0.000410 0.012157 0.013609
> 128 64 0.013138 0.000417 0.012452 0.013826
> 128 128 0.014016 0.000505 0.013195 0.014942 +
> 128 256 0.015843 0.000521 0.015107 0.016725 +
> 128 512 0.052240 0.079323 0.027019 0.320653 !
> 128 1024 0.123884 0.121560 0.038062 0.308929 !
> 128 2048 0.176877 0.125229 0.074457 0.387276 !
> 128 4096 0.305030 0.121716 0.176640 0.496375 !
> 128 8192 0.546405 0.108007 0.415272 0.899858 !
> 128 16384 0.604844 0.056576 0.558657 0.843943 !
> 128 32768 1.235298 0.097969 1.094720 1.451241 !
> 128 65536 2.926902 0.312733 2.458742 3.895563 !
> 128 131072 6.208087 0.472115 5.354304 7.317153 !
>
> The alternative all-to-all has the same performance problems, but they set
> in later ... and last longer ;( The results for the other cases look
> similar.

Two things we can do right now, add a new alltoall like yours (adding
yours to the code would require legal paperwork (3rd party stuff) and
correct the decision function, but really we just need to find out what is
causing this as the current tuned collective alltoall looks faster when
this effect is not occuring anyway. It could be anything from a
hardware/configuration issue to a problem in the BTL/PTLs.

I am currently visiting HLRS/Stuttgart so I will try and call you in an
hour or so, if your leaving soon I can call you tomorrow morning?

Thanks,
         Graham.
----------------------------------------------------------------------
Dr Graham E. Fagg | Distributed, Parallel and Meta-Computing
Innovative Computing Lab. PVM3.4, HARNESS, FT-MPI, SNIPE & Open MPI
Computer Science Dept | Suite 203, 1122 Volunteer Blvd,
University of Tennessee | Knoxville, Tennessee, USA. TN 37996-3450
Email: fagg_at_[hidden] | Phone:+1(865)974-5790 | Fax:+1(865)974-8296
Broken complex systems are always derived from working simple systems
----------------------------------------------------------------------