I was hanging problems with 1.2.5 hanging during collective operations
(MPI_Gather and MPI_Barrier):
2008/3/27 Matt Hughes <matt.c.hughes_at_[hidden]>:
> A similar problem was reported in this message, and a 1.3 nightly was
> reported to work:
> I tested the code in that message, and it hangs (actually, runs very
> slowly after a few iterations) with 1.2.5, but works find with 1.3.
I was able to eliminate the hang I was seeing with 1.2.5 during the
gather operation by using these btl parameters (found at
Only the btl_openib_rd_low=75 and btl_openib_rd_num=128 parameters are
necessary to avoid the hang.
The information given for the parameters in ompi_info is not very
helpful. Can anyone explain (or point me to a reference) what these
parameters do and how they affect collective operations?
> My own code starts worker processes with MPI::Comm::Spawn, and does a
> series of Bcast's and Gather's from the parent process. Large
> messages are passed between the spawned processes using ISend / IRecv
> / Wait, and that works fine. The crash or hang is always observed in
> the parent process during the Gather operation.
> I suspect this may have something to do with eager rdma, so I ran some
> tests with different values of btl_openib_use_eager_rdma. On 1.2.5,
> no difference was observed. It always hung after about 20 Gathers.
> On 1.3:
> * Not set: parent process crashes with a null pointer dereference on
> the 10th Gather operation.
> * Set to 0: parent process crashes with a null pointer dereference on
> the 33rd Gather operation.
> * Set to 1: parent process hangs on the 7th Gather operation.
> I built 1.3 in debug mode and attempted to narrow down where the crash
> (segfault due to null pointer).
> Before the crash, the stack trace looks like this:
> #0 PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80,
> recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0,
> comm=0xd5bbd0) at pgather.c:138
> #1 0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890,
> sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40,
> recvcount=1, recvtype=@0xa33950, root=0)
> at /home/matt/opt/openmpi/1.3/include/openmpi/ompi/mpi/cxx/comm_inln.h:325
> Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an
> immediate crash, but comm->c_coll.coll_gather itself is not null (it
> is the same as for successful Gathers).
> Can anyone suggest where I can go from here?
> Matt Hughes