While running the IMB 3.1 with OpenMPI 1.2.7 over MX 1.2.7, I see
hangs in most collective operations when testing with 16K buffers when
running on 128 nodes, one MPI rank per node, with the default
works: PingPong, PingPing, Sendrecv, Exchange, Allreduce, Reduce,
Reduce_scatter, Allgather, Bcast, Barrier
hangs after 8K: Allgatherv, Gather, Gatherv, Scatter, Scatterv,
("hangs after 8K" means that the results for 8K are printed, but those
for 16K are not - I've allowed them several hours after the 8K results
have been printed before killing the jobs; the processes continue to
use CPU time, but no progress seems to be made). I've only recently
been able to run the IMB on such large number of nodes, some lower
level issues prevented me from running them before.
The IMB finishes successfully in the same conditions when run on 64
nodes. With Allgatherv I've found that the breaking point is somewhere
around 90 nodes: it works with 88 nodes and it hangs with 90 nodes.
The above tests were performed with the default settings. When I
specify '--mca mtl mx --mca pml cm' the IMB finishes successfully on
128 nodes; with MPICH-MX, the IMB also finishes successfully on 128
nodes. However, I consider it to be a big problem if the default
OpenMPI settings lead to hangs.
Is this a known (but undocumented) behaviour ? Do other sites with a
similar setup observe these hangs ? Can someone suggest what to do to
avoid them or at least a way to debug this ?
Thanks in advance !
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850