I have a cluster using Mellanox (dual port) ConnectX hardware, and
I'm having some problems with MPI_Gather operations. The vendor id is
0x2c9, and the part id is 26418. I had to add the vendor part id to
mca-btl-openib-hca-params.ini, but the problems are the same for both
1.2.5 and 1.3, whether the part ID is in the ini file or not.
The details of my hardware, the OpenMPI 1.3 configuration, and the
runtime environment are included in the attached tar.gz file.
A similar problem was reported in this message, and a 1.3 nightly was
reported to work:
I tested the code in that message, and it hangs (actually, runs very
slowly after a few iterations) with 1.2.5, but works find with 1.3.
My own code starts worker processes with MPI::Comm::Spawn, and does a
series of Bcast's and Gather's from the parent process. Large
messages are passed between the spawned processes using ISend / IRecv
/ Wait, and that works fine. The crash or hang is always observed in
the parent process during the Gather operation.
I suspect this may have something to do with eager rdma, so I ran some
tests with different values of btl_openib_use_eager_rdma. On 1.2.5,
no difference was observed. It always hung after about 20 Gathers.
* Not set: parent process crashes with a null pointer dereference on
the 10th Gather operation.
* Set to 0: parent process crashes with a null pointer dereference on
the 33rd Gather operation.
* Set to 1: parent process hangs on the 7th Gather operation.
I built 1.3 in debug mode and attempted to narrow down where the crash
(segfault due to null pointer).
Before the crash, the stack trace looks like this:
#0 PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80,
recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0,
comm=0xd5bbd0) at pgather.c:138
#1 0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890,
sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40,
recvcount=1, recvtype=@0xa33950, root=0)
Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an
immediate crash, but comm->c_coll.coll_gather itself is not null (it
is the same as for successful Gathers).
Can anyone suggest where I can go from here?