Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Deadlock on openib when using hindexed types
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-09-04 08:55:04


Hi Rolf,

I was indeed running a more than 4 weeks old trunk, but after pulling the
latest version (and checking the patch was in the code), it seems to make
no difference.

However, I know where to look at now, thanks !

Sylvain

On Fri, 4 Sep 2009, Rolf Vandevaart wrote:

> I think you are running into a bug that we saw also and we recently fixed.
> We would see a hang when we were sending from a contiguous type to a
> non-contiguous type using a single port over openib. The problem was that
> the state of the request on the sending side was not being properly updated
> in that case. The reason we see it with only one port vs two is because
> different protocols are used depending on the number of ports.
>
> Don Kerr found and fixed the problem in both the trunk and the branch.
>
> Trunk: https://svn.open-mpi.org/trac/ompi/changeset/21775
> 1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833
>
> If you are running the latest bits and still seeing the problem, then I guess
> it is something else.
>
> Rolf
>
> On 09/04/09 04:40, Sylvain Jeaugey wrote:
>> Hi all,
>>
>> We're currently working with romio and we hit a problem when exchanging
>> data with hindexed types with the openib btl.
>>
>> The attached reproducer (adapted from romio) is working fine on tcp, blocks
>> on openib when using 1 port but works if we use 2 ports (!). I tested it
>> against the trunk and the 1.3.3 release with the same conclusions.
>>
>> The basic idea is : processes 0..3 send contiguous data to process 0. 0
>> receives these buffers with an hindexed datatype which scatters data at
>> different offsets.
>>
>> Receiving in a contiguous manner works, but receiving with an hindexed
>> datatype makes the remote sends block. Yes, the remote send, not the
>> receive. The receive is working fine and data is correctly scattered on the
>> buffer, but the senders on the other node are stuck in the Wait().
>>
>> I tried not using MPI_BOTTOM, which changed nothing. It seems that the
>> problem only occurs when STRIPE*NB (the size of the send) is higher than
>> 12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
>> by increasing the RDMA threshold.
>>
>> I've tried to do some debugging, but I'm a bit lost on where the
>> non-contiguous types are handled and how they affect btl communication.
>>
>> So, if anyone has a clue on where I should look at, I'm interested !
>>
>> Thanks,
>> Sylvain
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
>
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>