Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Steve Wise (swise_at_[hidden])
Date: 2007-05-09 18:15:15

On Wed, 2007-05-09 at 17:55 -0700, Andrew Friedley wrote:
> Steve Wise wrote:
> > On Wed, 2007-05-09 at 16:15 -0700, Andrew Friedley wrote:
> >> Steve Wise wrote:
> >>> There have been a series of discussions on the ofa general list about
> >>> this issue, and the conclusion to date is that it cannot be resolved in
> >>> the rdma-cm or iwarp-cm code of the linux rdma stack. Mainly because
> >>> sending an RDMA message involves the ULP's work queue and completion
> >>> queue, so the CM cannot do this under the covers in a mannor that
> >>> doesn't affect the application. Thus, the applications must deal with
> >>> this.
> >> Why can't uDAPL deal with this? As a uDAPL user, I really don't care
> >> what API uDAPL is using under the hood to move data from one place to
> >> another, nor the quirks of that API. The whole point of uDAPL is to
> >> form a network-agnostic abstraction layer. AFAIK, the uDAPL spec
> >> doesn't enforce any such requirement on RDMA communication either. In
> >> my opinion, exposing such behavior above uDAPL is incorrect and is part
> >> of why uDAPL has seen limited adoption -- every single uDAPL
> >> implementation behaves in different ways, making it extremely difficult
> >> to write an application to work on any uDAPL implementation. Sorry if
> >> this sounds harsh, but this comes from many hours of banging my head on
> >> the wall due to working around these sorts of problems :)
> >>
> >
> > I understand your frustration. I think the MPA protocol is deficient in
> > this respect and should have required the necessary "first FPDU" to be
> > sent under the covers by the RNICs. A RTR packet if you will. To
> > resolve this issue "properly", in my opinion, would involve changing the
> > IETF MPA spec and also breaking all the existing iwarp HW. We can't do
> > that.
> Understood.
> > The reason it is hard or impossible to solve this in the DAPL layer is
> > that any rdma operation on the QP affects the state of that QP and the
> > associate CQs. In addition, if you use an RDMA send to enforce this you
> > impact the other side by consuming a RECV buffer. So its hard if not
> > impossible to do this under the covers without affecting the
> > application's resources.
> Is there no way to do this before passing connection established events
> to the uDAPL consumer? I need to go read up on the uDAPL API to really
> understand why this wouldn't work.

Perhaps the dapl or maybe even a OFA iWARP CM could defer passing up the
"established" event on the passive side until an incoming SEND is
detected. I know we've discussed this before, but I'm not sure why this
was not a workable solution. Perhaps Caitlin or some iwarp folks can

> >
> > Also, the DAPL specification had a goal to not impose any additional
> > protocol on the wire. If you add this under the covers, then you add
> > such a "protocol" and break interoperability between a connection
> > accessed via DAPL on one end and some other API on the other end.
> So I guess there's no 'right' solution, at least at the uDAPL level.
> With RDMACM/OFA verbs, there's at least the argument that you can design
> the API/semantics however you please, while uDAPL is already standardized.

Yes, but its still difficult to post a SEND under the covers because it
consumes the application resources in the form of QP and CQ space and a
RECV buffer.

So to date, we have...punted and pushed to problem to the ULP.

> I hope you guys are documenting this in a way that makes this issue
> extremely clear to both uDAPL and OFA verbs (is this the right naming?)
> users. Maybe it's been done already, but is it possible to emit some
> sort of loud warning/error when the accept()'ing side tries to send
> before a receive?

The connection comes tumbling down. How's that for loud? :)

Seriously though, it isn't documented well enough. But we're bleeding
edge here. And I'm still hoping somebody will come up with an elegant
solution that doesn't break interoperability, applications and/or iwarp
hw (i'm a dreamer :).