Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Steve Wise (swise_at_[hidden])
Date: 2007-05-14 10:39:41

On Sun, 2007-05-13 at 21:26 -0400, Donald Kerr wrote:
> Caitlin Bestler wrote:
> >Donal Kerr wrote:
> >
> >
> >
> >>>>order of business after connection establishment
> >>>>(mba_btl_udapl_sendrecv(). The RECV buffer post for this exchange,
> >>>>however, should really be done _before_ the
> >>>>dat_ep_connect() on the active side, and _before_ the
> >>>>dat_cr_accept() on the server side.
> >>>>Currently its done after the ESTABLISHED event is dequeued, thus
> >>>>allowing the race condition.
> >>>>
> >>>>I believe the rules are the ULP must ensure that a RECV is posted
> >>>>before the client can post a SEND for that buffer.
> >>>>And further, the ULP must enforce flow control somehow so that a
> >>>>SEND never arrives without a RECV buffer being available.
> >>>>
> >>>>
> >>>>
> >>>>
> >>maybe this is a rule iwarp imposes on its ULPs but not uDAPL.
> >>
> >>
> >>
> >
> >It is most assuredly a rule for uDAPL. And it is not a matter
> >of iWARP "imposing" on uDAPL. uDAPL was explicitly designed
> >to support IB, iWARP and VI. To do that DAPL documents its
> >model of what RDMA is.
> >
> >
> (sorry I was off the grid for a couple of days)
> Not to beat a dead horse but you would have to show me where in the Spec
> it says I must post a recv before a send. And thinking about it some I
> don't believe there is a race condition because this is not called out
> as such. Now if posting the handshake recv before the connect call
> speeds things up and helps the iwarp scenario I am all for it.
> >This issue is in fact one that is truly fundamental to the
> >efficiency of RDMA -- the transport layer DOES NOT provide
> >buffering. That's the application's job. It is precisely
> >because the application layer does a better job that RDMA
> >can achieve better performance at high bandwidth.
> >
> >For reasons that have been discussed in more depth in the
> >RDMA applicability statement and in RDDP/IPS discussions
> >on iSER, the absence of transport layer buffer throttling
> >places the onus for end-to-end pacing on the application.
> >It is a situation somewhat akin to a car with a broken
> >spedometer that had previously only driven during rush
> >hour bumper-to-bumper traffic. The fact that the spedometer
> >was broken was irrelevant. But if that same car hits the
> >open road the driver will need to come up with some method
> >of regulating their speed.
> >
> >The DAPL semantics are very clear that send/recv operations must
> >be matched one to one, that the receive buffer must be large
> >enough for the received message and that there must be a receive
> >buffer for each incoming send/recv message. That means that
> >the sender needs to have some basis for believing that the
> >RECV has been posted. Usually this is an explicit credit
> >that is decremented per message and incremented per response.
> >
> >
> Matching one to one sure, still does not say a recv must be posted
> before a send. Flow control is handled by the BTL.
> >What DAPL does not state is if the transport does explicit flow
> >control so that the sending application's work request is simply
> >not processed (and the sending application continues to provide
> >the buffer, as with InfiniBand) or whether the sender simply
> >transmits and leaves error detection to the receiver (iWARP).
> >There are theoretical advantages to both, but more importantly
> >neither of them is going to change. So the Consumer of RDMA
> >applications needs to use ULP/application layer flow control
> >to pace the transmitter. At the application layer that means
> >that the RECV must be posted *before* the Send/accept that
> >grants ULP credits to the far side.
> >
> >All of that should be clear in the IOV ownership rules and
> >discussion of the semantics of send/recv. If you thought you
> >saw something that implied any guarantees to the contrary
> >then could you point them out in a posting to the DAT reflector?
> >(or just send them to me or Arkady Kanevsky).
> >
> >
> I believe it was either your Steve who claimed a recv must be posted
> before a send thus leading to a race condition. I fail to see this. But
> again, if Steve's patch makes things better I am all for it.

For iWARP, the connection may be TERMINATED if a SEND arrives on a QP
and no corresponding RECV buffer is posted.