Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Waitall never returns
From: Ross Boylan (ross_at_[hidden])
Date: 2014-04-10 16:06:16

On 4/10/2014 11:48 AM, Ross Boylan wrote:
> On 4/9/2014 5:26 PM, Ross Boylan wrote:
>> On Fri, 2014-04-04 at 22:40 -0400, George Bosilca wrote:
>>> Ross,
>>> I’m not familiar with the R implementation you are using, but bear
>>> with me and I will explain how you can all Open MPI about the list
>>> of all pending requests on a process. Disclosure: This is Open MPI
>>> deep voodoo, an extreme way to debug applications that might save
>>> you quite some time.
>>> The only thing you need is the communicator you posted your requests
>>> into, or at least a pointer to it. Then you attach to your process
>>> (or processes) with your preferred debugger and call
>>> mca_pml_ob1_dump(struct ompi_communicator_t* comm, int verbose)
>>> With gdb this should look like “call mca_pml_ob1_dump(my_comm, 1)”.
>>> This will dump human readable information about all the requests
>>> pending on a communicator (both sends and receives).
>> Thank you so much for the tip. After inserting a barrier failed to help
I managed to reproduce the problem with all ranks on one node. I see
BTL SM 0x7fe9970ae660 endpoint 0x1f13470 [smp_rank 5] [peer_rank 0]
BTL SM 0x7fe9970ae660 endpoint 0x20eebb0 [smp_rank 5] [peer_rank 12]
which, if my previous theory of mca_mpl_ob1_dump is correct, means there
are no outstanding requests since there are no items listed under the
BTL lines.

This again has me wondering if requests can be closed without some kind
of Wait or Test command.

Sometimes the system runs to completion; the trigger seems to be having
some ranks that finish rapidly because there are more such processes
than work for them to do.