Ok, I was wrong, the fix works.
Actually, I rebuilt with the latest trunk but openib support was somehow
dropped. I was running on tcp.
Which brings us to the next issue : tcp is actually not working (I don't
know why I was convinced that tcp worked). The fix fixed the problem for
openib, but if I'm not mistaken (again !) tcp still hangs.
On Fri, 4 Sep 2009, Sylvain Jeaugey wrote:
> Hi Rolf,
> I was indeed running a more than 4 weeks old trunk, but after pulling the
> latest version (and checking the patch was in the code), it seems to make no
> However, I know where to look at now, thanks !
> On Fri, 4 Sep 2009, Rolf Vandevaart wrote:
>> I think you are running into a bug that we saw also and we recently fixed.
>> We would see a hang when we were sending from a contiguous type to a
>> non-contiguous type using a single port over openib. The problem was that
>> the state of the request on the sending side was not being properly updated
>> in that case. The reason we see it with only one port vs two is because
>> different protocols are used depending on the number of ports.
>> Don Kerr found and fixed the problem in both the trunk and the branch.
>> Trunk: https://svn.open-mpi.org/trac/ompi/changeset/21775
>> 1.3 Branch: https://svn.open-mpi.org/trac/ompi/changeset/21833
>> If you are running the latest bits and still seeing the problem, then I
>> guess it is something else.
>> On 09/04/09 04:40, Sylvain Jeaugey wrote:
>>> Hi all,
>>> We're currently working with romio and we hit a problem when exchanging
>>> data with hindexed types with the openib btl.
>>> The attached reproducer (adapted from romio) is working fine on tcp,
>>> blocks on openib when using 1 port but works if we use 2 ports (!). I
>>> tested it against the trunk and the 1.3.3 release with the same
>>> The basic idea is : processes 0..3 send contiguous data to process 0. 0
>>> receives these buffers with an hindexed datatype which scatters data at
>>> different offsets.
>>> Receiving in a contiguous manner works, but receiving with an hindexed
>>> datatype makes the remote sends block. Yes, the remote send, not the
>>> receive. The receive is working fine and data is correctly scattered on
>>> the buffer, but the senders on the other node are stuck in the Wait().
>>> I tried not using MPI_BOTTOM, which changed nothing. It seems that the
>>> problem only occurs when STRIPE*NB (the size of the send) is higher than
>>> 12k -namely the RDMA threshold- but I didn't manage to remove the deadlock
>>> by increasing the RDMA threshold.
>>> I've tried to do some debugging, but I'm a bit lost on where the
>>> non-contiguous types are handled and how they affect btl communication.
>>> So, if anyone has a clue on where I should look at, I'm interested !
>>> devel mailing list
>> devel mailing list