Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Deadlock on openib when using hindexed types
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-09-04 11:01:47


Ok, I was wrong, the fix works.

Actually, I rebuilt with the latest trunk but openib support was somehow
dropped. I was running on tcp.

Which brings us to the next issue : tcp is actually not working (I don't
know why I was convinced that tcp worked). The fix fixed the problem for
openib, but if I'm not mistaken (again !) tcp still hangs.

Sylvain

On Fri, 4 Sep 2009, Sylvain Jeaugey wrote:

> Hi Rolf,
>
> I was indeed running a more than 4 weeks old trunk, but after pulling the
> latest version (and checking the patch was in the code), it seems to make no
> difference.
>
> However, I know where to look at now, thanks !
>
> Sylvain
>
> On Fri, 4 Sep 2009, Rolf Vandevaart wrote:
>
>> I think you are running into a bug that we saw also and we recently fixed.
>> We would see a hang when we were sending from a contiguous type to a
>> non-contiguous type using a single port over openib. The problem was that
>> the state of the request on the sending side was not being properly updated
>> in that case. The reason we see it with only one port vs two is because
>> different protocols are used depending on the number of ports.
>>
>> Don Kerr found and fixed the problem in both the trunk and the branch.
>>
>> Trunk: https://svn.open-mpi.org/trac/ompi/changeset/21775
>> 1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833
>>
>> If you are running the latest bits and still seeing the problem, then I
>> guess it is something else.
>>
>> Rolf
>>
>> On 09/04/09 04:40, Sylvain Jeaugey wrote:
>>> Hi all,
>>>
>>> We're currently working with romio and we hit a problem when exchanging
>>> data with hindexed types with the openib btl.
>>>
>>> The attached reproducer (adapted from romio) is working fine on tcp,
>>> blocks on openib when using 1 port but works if we use 2 ports (!). I
>>> tested it against the trunk and the 1.3.3 release with the same
>>> conclusions.
>>>
>>> The basic idea is : processes 0..3 send contiguous data to process 0. 0
>>> receives these buffers with an hindexed datatype which scatters data at
>>> different offsets.
>>>
>>> Receiving in a contiguous manner works, but receiving with an hindexed
>>> datatype makes the remote sends block. Yes, the remote send, not the
>>> receive. The receive is working fine and data is correctly scattered on
>>> the buffer, but the senders on the other node are stuck in the Wait().
>>>
>>> I tried not using MPI_BOTTOM, which changed nothing. It seems that the
>>> problem only occurs when STRIPE*NB (the size of the send) is higher than
>>> 12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
>>> by increasing the RDMA threshold.
>>>
>>> I've tried to do some debugging, but I'm a bit lost on where the
>>> non-contiguous types are handled and how they affect btl communication.
>>>
>>> So, if anyone has a clue on where I should look at, I'm interested !
>>>
>>> Thanks,
>>> Sylvain
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>>
>> =========================
>> rolf.vandevaart_at_[hidden]
>> 781-442-3043
>> =========================
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>