Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Waitall never returns [solved]
From: Ross Boylan (ross_at_[hidden])
Date: 2014-04-10 16:40:52


Waitall was not returning for the mundane reason that not all messages
sent were received. I'm not sure why the dump command seemed to say
there was nothing waiting. Ironically, the bug would never appear in
production, only in testing.

I fixed up my shutdown logic and all seems well now.

Ross
On 4/10/2014 1:06 PM, Ross Boylan wrote:
> On 4/10/2014 11:48 AM, Ross Boylan wrote:
>> On 4/9/2014 5:26 PM, Ross Boylan wrote:
>>> On Fri, 2014-04-04 at 22:40 -0400, George Bosilca wrote:
>>>> Ross,
>>>>
>>>> I’m not familiar with the R implementation you are using, but bear
>>>> with me and I will explain how you can all Open MPI about the list
>>>> of all pending requests on a process. Disclosure: This is Open MPI
>>>> deep voodoo, an extreme way to debug applications that might save
>>>> you quite some time.
>>>>
>>>> The only thing you need is the communicator you posted your
>>>> requests into, or at least a pointer to it. Then you attach to your
>>>> process (or processes) with your preferred debugger and call
>>>> mca_pml_ob1_dump(struct ompi_communicator_t* comm, int verbose)
>>>>
>>>> With gdb this should look like “call mca_pml_ob1_dump(my_comm, 1)”.
>>>> This will dump human readable information about all the requests
>>>> pending on a communicator (both sends and receives).
>>>>
>>> Thank you so much for the tip. After inserting a barrier failed to
>>> help
> I managed to reproduce the problem with all ranks on one node. I see
> BTL SM 0x7fe9970ae660 endpoint 0x1f13470 [smp_rank 5] [peer_rank 0]
> ....
> BTL SM 0x7fe9970ae660 endpoint 0x20eebb0 [smp_rank 5] [peer_rank 12]
> which, if my previous theory of mca_mpl_ob1_dump is correct, means
> there are no outstanding requests since there are no items listed
> under the BTL lines.
>
> This again has me wondering if requests can be closed without some
> kind of Wait or Test command.
>
> Sometimes the system runs to completion; the trigger seems to be
> having some ranks that finish rapidly because there are more such
> processes than work for them to do.
>
>