Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Caitlin Bestler (caitlinb_at_[hidden])
Date: 2007-05-09 18:25:06

general-bounces_at_[hidden] wrote:
> On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
>> Steve Wise wrote:
>>> On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
>>>> Steve Wise wrote:
>>>>> There have been a series of discussions on the ofa general list
>>>>> about this issue, and the conclusion to date is that it cannot be
>>>>> resolved in the rdma-cm or iwarp-cm code of the linux rdma stack.
>>>>> Mainly because sending an RDMA message involves the ULP's work
>>>>> queue and completion queue, so the CM cannot do this under the
>>>>> covers in a mannor that doesn't affect the application.
> Thus, the
>>>>> applications must deal with this.
>>>> Why can't uDAPL deal with this? As a uDAPL user, I really don't
>>>> care what API uDAPL is using under the hood to move data from one
>>>> place to another, nor the quirks of that API. The whole point of
>>>> uDAPL is to form a network-agnostic abstraction layer. AFAIK, the
>>>> uDAPL spec doesn't enforce any such requirement on RDMA
>>>> communication either. In my opinion, exposing such behavior above
>>>> uDAPL is incorrect and is part of why uDAPL has seen limited
>>>> adoption -- every single uDAPL implementation behaves in different
>>>> ways, making it extremely difficult to write an application to work
>>>> on any uDAPL implementation. Sorry if this sounds harsh, but this
>>>> comes from many hours of banging my head on the wall due to working
>>>> around these sorts of problems :)
>>> I understand your frustration. I think the MPA protocol is
>>> deficient in this respect and should have required the necessary
>>> "first FPDU" to be sent under the covers by the RNICs. A RTR packet
>>> if you will. To resolve this issue "properly", in my opinion, would
>>> involve changing the IETF MPA spec and also breaking all the
>>> existing iwarp HW. We can't do that.
>> Understood.
>>> The reason it is hard or impossible to solve this in the DAPL layer
>>> is that any rdma operation on the QP affects the state of that QP
>>> and the associate CQs. In addition, if you use an RDMA send to
>>> enforce this you impact the other side by consuming a RECV buffer.
>>> So its hard if not impossible to do this under the covers without
>>> affecting the application's resources.
>> Is there no way to do this before passing connection established
>> events to the uDAPL consumer? I need to go read up on the uDAPL API
>> to really understand why this wouldn't work.
> Perhaps the dapl or maybe even a OFA iWARP CM could defer
> passing up the "established" event on the passive side until
> an incoming SEND is detected. I know we've discussed this
> before, but I'm not sure why this was not a workable
> solution. Perhaps Caitlin or some iwarp folks can recall?

That was what the RNIC-PI flag would have enabled. DAPL could
check for that flag in a transport/device independent way, and
delay the established event until it was safe to post (but no
longer than required, for IB and iWARP NICs that fenced the first
transmit the Established Event could be generated immediately).

So yes, the transport layer (OFA or DAPL) CAN hide this on
the passive side.

But as you point out, that doesn't solve the problem of needing
the Send from the active side. Since the Consumer posts RECV
buffers *before* indicating whether the QP/EP will be used
on the passive or active end, and there are no standard verbs
to jam a receive buffer to the head of an RQ, there is no way
to hide a send/recv exchange from the application layer.

The fact that it can't be made transparent on the active side
certainly diminishes the value of making it traansparent on
the receive side. It's still a good idea, but I don't think
it has percolated to the top of anyone's TODO list yet.
When it does, the RNIC-PI proposed flag is a simple capability
flag that is quite easy for any provider to statically set.