Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] Deadlock on openib when using hindexed types
From: Sylvain Jeaugey (sylvain.jeaugey_at_[hidden])
Date: 2009-09-04 08:55:04

Hi Rolf,

I was indeed running a more than 4 weeks old trunk, but after pulling the
latest version (and checking the patch was in the code), it seems to make
no difference.

However, I know where to look at now, thanks !


On Fri, 4 Sep 2009, Rolf Vandevaart wrote:

> I think you are running into a bug that we saw also and we recently fixed.
> We would see a hang when we were sending from a contiguous type to a
> non-contiguous type using a single port over openib. The problem was that
> the state of the request on the sending side was not being properly updated
> in that case. The reason we see it with only one port vs two is because
> different protocols are used depending on the number of ports.
> Don Kerr found and fixed the problem in both the trunk and the branch.
> Trunk:
> 1.3 Branch:
> If you are running the latest bits and still seeing the problem, then I guess
> it is something else.
> Rolf
> On 09/04/09 04:40, Sylvain Jeaugey wrote:
>> Hi all,
>> We're currently working with romio and we hit a problem when exchanging
>> data with hindexed types with the openib btl.
>> The attached reproducer (adapted from romio) is working fine on tcp, blocks
>> on openib when using 1 port but works if we use 2 ports (!). I tested it
>> against the trunk and the 1.3.3 release with the same conclusions.
>> The basic idea is : processes 0..3 send contiguous data to process 0. 0
>> receives these buffers with an hindexed datatype which scatters data at
>> different offsets.
>> Receiving in a contiguous manner works, but receiving with an hindexed
>> datatype makes the remote sends block. Yes, the remote send, not the
>> receive. The receive is working fine and data is correctly scattered on the
>> buffer, but the senders on the other node are stuck in the Wait().
>> I tried not using MPI_BOTTOM, which changed nothing. It seems that the
>> problem only occurs when STRIPE*NB (the size of the send) is higher than
>> 12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
>> by increasing the RDMA threshold.
>> I've tried to do some debugging, but I'm a bit lost on where the
>> non-contiguous types are handled and how they affect btl communication.
>> So, if anyone has a clue on where I should look at, I'm interested !
>> Thanks,
>> Sylvain
>> ------------------------------------------------------------------------
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
> --
> =========================
> rolf.vandevaart_at_[hidden]
> 781-442-3043
> =========================
> _______________________________________________
> devel mailing list
> devel_at_[hidden]