I was indeed running a more than 4 weeks old trunk, but after pulling the
latest version (and checking the patch was in the code), it seems to make
However, I know where to look at now, thanks !
On Fri, 4 Sep 2009, Rolf Vandevaart wrote:
> I think you are running into a bug that we saw also and we recently fixed.
> We would see a hang when we were sending from a contiguous type to a
> non-contiguous type using a single port over openib. The problem was that
> the state of the request on the sending side was not being properly updated
> in that case. The reason we see it with only one port vs two is because
> different protocols are used depending on the number of ports.
> Don Kerr found and fixed the problem in both the trunk and the branch.
> Trunk: https://svn.open-mpi.org/trac/ompi/changeset/21775
> 1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833
> If you are running the latest bits and still seeing the problem, then I guess
> it is something else.
> On 09/04/09 04:40, Sylvain Jeaugey wrote:
>> Hi all,
>> We're currently working with romio and we hit a problem when exchanging
>> data with hindexed types with the openib btl.
>> The attached reproducer (adapted from romio) is working fine on tcp, blocks
>> on openib when using 1 port but works if we use 2 ports (!). I tested it
>> against the trunk and the 1.3.3 release with the same conclusions.
>> The basic idea is : processes 0..3 send contiguous data to process 0. 0
>> receives these buffers with an hindexed datatype which scatters data at
>> different offsets.
>> Receiving in a contiguous manner works, but receiving with an hindexed
>> datatype makes the remote sends block. Yes, the remote send, not the
>> receive. The receive is working fine and data is correctly scattered on the
>> buffer, but the senders on the other node are stuck in the Wait().
>> I tried not using MPI_BOTTOM, which changed nothing. It seems that the
>> problem only occurs when STRIPE*NB (the size of the send) is higher than
>> 12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
>> by increasing the RDMA threshold.
>> I've tried to do some debugging, but I'm a bit lost on where the
>> non-contiguous types are handled and how they affect btl communication.
>> So, if anyone has a clue on where I should look at, I'm interested !
>> devel mailing list
> devel mailing list