Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] btl_openib_rd_{num, low} parameters? (Was Re: ConnectX hang with 1.2.5, crash with 1.3, during gather)
From: Matt Hughes (matt.c.hughes+ompi_at_[hidden])
Date: 2008-04-04 17:47:49


I was hanging problems with 1.2.5 hanging during collective operations
(MPI_Gather and MPI_Barrier):

2008/3/27 Matt Hughes <matt.c.hughes_at_[hidden]>:
> A similar problem was reported in this message, and a 1.3 nightly was
> reported to work:
> http://www.open-mpi.org/community/lists/users/2008/01/4891.php
>
> I tested the code in that message, and it hangs (actually, runs very
> slowly after a few iterations) with 1.2.5, but works find with 1.3.

I was able to eliminate the hang I was seeing with 1.2.5 during the
gather operation by using these btl parameters (found at
http://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/openib/btl-openib-benchmark):

 btl_openib_max_btls=20
 btl_openib_rd_num=128
 btl_openib_rd_low=75
 btl_openib_rd_win=50
 btl_openib_max_eager_rdma=32
 mpool_base_use_mem_hooks=1
 mpi_leave_pinned=1

Only the btl_openib_rd_low=75 and btl_openib_rd_num=128 parameters are
necessary to avoid the hang.

The information given for the parameters in ompi_info is not very
helpful. Can anyone explain (or point me to a reference) what these
parameters do and how they affect collective operations?

Thanks,
mch

>
> My own code starts worker processes with MPI::Comm::Spawn, and does a
> series of Bcast's and Gather's from the parent process. Large
> messages are passed between the spawned processes using ISend / IRecv
> / Wait, and that works fine. The crash or hang is always observed in
> the parent process during the Gather operation.
>
> I suspect this may have something to do with eager rdma, so I ran some
> tests with different values of btl_openib_use_eager_rdma. On 1.2.5,
> no difference was observed. It always hung after about 20 Gathers.
> On 1.3:
>
> * Not set: parent process crashes with a null pointer dereference on
> the 10th Gather operation.
> * Set to 0: parent process crashes with a null pointer dereference on
> the 33rd Gather operation.
> * Set to 1: parent process hangs on the 7th Gather operation.
>
> I built 1.3 in debug mode and attempted to narrow down where the crash
> (segfault due to null pointer).
>
> Before the crash, the stack trace looks like this:
>
> #0 PMPI_Gather (sendbuf=0x7fbfffe494, sendcount=1, sendtype=0x2a958aab80,
> recvbuf=0xda1a40, recvcount=1, recvtype=0x2a958aab80, root=0,
> comm=0xd5bbd0) at pgather.c:138
> #1 0x0000000000608ff4 in MPI::Comm::Gather (this=0xcdd890,
> sendbuf=0x7fbfffe494, sendcount=1, sendtype=@0xa33950, recvbuf=0xda1a40,
> recvcount=1, recvtype=@0xa33950, root=0)
> at /home/matt/opt/openmpi/1.3/include/openmpi/ompi/mpi/cxx/comm_inln.h:325
>
> Stepping into comm->c_coll.coll_gather at pgather.c:138 results in an
> immediate crash, but comm->c_coll.coll_gather itself is not null (it
> is the same as for successful Gathers).
>
> Can anyone suggest where I can go from here?
>
> Thanks,
> Matt Hughes
>