Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2007-05-09 17:22:44


I talked with Steve a bunch on the phone about this.

1. This "connector must RDMA first" issue is an iWARP restriction --
it's not specific to udapl or verbs. For example, if you try to use
udapl with iWARP on Solaris, you'll have the same issue (I have no
idea whether you have iWARP drivers in Solaris or not).

2. Per his prior e-mail (which I didn't fully grok until I talked to
him), using the RDMA CM in the openib BTL will not magically fix this
issue for us.

3. So for any of the BTLs to support iWARP -- regardless of
underlying protocol or OS -- they are going to have to obey this
restriction.

4. Luckily, in iWARP, the restriction can be met by either send/
receive semantics *or* RDMA semantics. You don't have to
specifically use RDMA verbs semantics, for example. This is good
because of the way that OMPI works (the first fragment that will be
transmitted is pretty much guaranteed to be a send/receive fragment,
not an RDMA fragment) -- it makes the logistics slightly simpler.

Galen Shipman and I talked about this a bit and suggest the following:

- During the connection dance (probably for both the udapl and openib
BTLs), whichever peer ends up being the connection initiator (don't
forget about the race condition where 2 peers may simultaneously
decide to initiate -- this case is handled properly in the OMPI code;
but just make sure you modify the side that ends up being actual
initiator), they can send their pending fragment immediately (and
Steve is right that there will always be a pending fragment, because
OMPI doesn't make a connection until the first send).

- The other peer (the receiver of the connection) must wait to send
its pending fragment(s) until it receives the first frag from the
connection initiator. This can be accomplished either with another
flag on the OMPI module struct or perhaps making it part of the
connection protocol (i.e., don't transition the endpoint to be
CONNECTED until the first fragment is received). Either of which can
be used to queue up fragments on the receiver until the first
fragment is received from the initiator. I'd have to look in the
code deeper, but I'm *guessing* that it might be best to use the
already-existing state flag (i.e., checking for CONNECTED) because
then you won't be introducing any more conditionals in the critical
path.

On May 9, 2007, at 4:45 PM, Donald Kerr wrote:

> I guess I have not read enough about iwarp yet but if iwarp is sitting
> below ib verbs or udapl in the stack and is trying to impose
> restrictions which ib verbs or udapl do not adhere to then maybe iwarp
> is in the wrong place in the ofed stack.
>
> Having said that I do agree the OMPI community needs to consider where
> iwarp plays in its own stack. If it has not already.
>
> Steve Wise wrote:
>
>> On Wed, 2007-05-09 at 16:27 -0400, Donald Kerr wrote:
>>
>>
>>> So then I agree with Andrew, I think you are trying to impose
>>> restrictions on uDAPL which are not part of the Spec.
>>>
>>>
>>>
>>
>> true, but if you want a single btl for IB and IW, then you'll need to
>> address this issue in some way...
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems