Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Donald Kerr (Don.Kerr_at_[hidden])
Date: 2007-05-13 21:26:16

Caitlin Bestler wrote:

>Donal Kerr wrote:
>>>>order of business after connection establishment
>>>>(mba_btl_udapl_sendrecv(). The RECV buffer post for this exchange,
>>>>however, should really be done _before_ the
>>>>dat_ep_connect() on the active side, and _before_ the
>>>>dat_cr_accept() on the server side.
>>>>Currently its done after the ESTABLISHED event is dequeued, thus
>>>>allowing the race condition.
>>>>I believe the rules are the ULP must ensure that a RECV is posted
>>>>before the client can post a SEND for that buffer.
>>>>And further, the ULP must enforce flow control somehow so that a
>>>>SEND never arrives without a RECV buffer being available.
>>maybe this is a rule iwarp imposes on its ULPs but not uDAPL.
>It is most assuredly a rule for uDAPL. And it is not a matter
>of iWARP "imposing" on uDAPL. uDAPL was explicitly designed
>to support IB, iWARP and VI. To do that DAPL documents its
>model of what RDMA is.
(sorry I was off the grid for a couple of days)
Not to beat a dead horse but you would have to show me where in the Spec
it says I must post a recv before a send. And thinking about it some I
don't believe there is a race condition because this is not called out
as such. Now if posting the handshake recv before the connect call
speeds things up and helps the iwarp scenario I am all for it.

>This issue is in fact one that is truly fundamental to the
>efficiency of RDMA -- the transport layer DOES NOT provide
>buffering. That's the application's job. It is precisely
>because the application layer does a better job that RDMA
>can achieve better performance at high bandwidth.
>For reasons that have been discussed in more depth in the
>RDMA applicability statement and in RDDP/IPS discussions
>on iSER, the absence of transport layer buffer throttling
>places the onus for end-to-end pacing on the application.
>It is a situation somewhat akin to a car with a broken
>spedometer that had previously only driven during rush
>hour bumper-to-bumper traffic. The fact that the spedometer
>was broken was irrelevant. But if that same car hits the
>open road the driver will need to come up with some method
>of regulating their speed.
>The DAPL semantics are very clear that send/recv operations must
>be matched one to one, that the receive buffer must be large
>enough for the received message and that there must be a receive
>buffer for each incoming send/recv message. That means that
>the sender needs to have some basis for believing that the
>RECV has been posted. Usually this is an explicit credit
>that is decremented per message and incremented per response.
Matching one to one sure, still does not say a recv must be posted
before a send. Flow control is handled by the BTL.

>What DAPL does not state is if the transport does explicit flow
>control so that the sending application's work request is simply
>not processed (and the sending application continues to provide
>the buffer, as with InfiniBand) or whether the sender simply
>transmits and leaves error detection to the receiver (iWARP).
>There are theoretical advantages to both, but more importantly
>neither of them is going to change. So the Consumer of RDMA
>applications needs to use ULP/application layer flow control
>to pace the transmitter. At the application layer that means
>that the RECV must be posted *before* the Send/accept that
>grants ULP credits to the far side.
>All of that should be clear in the IOV ownership rules and
>discussion of the semantics of send/recv. If you thought you
>saw something that implied any guarantees to the contrary
>then could you point them out in a posting to the DAT reflector?
>(or just send them to me or Arkady Kanevsky).
I believe it was either your Steve who claimed a recv must be posted
before a send thus leading to a race condition. I fail to see this. But
again, if Steve's patch makes things better I am all for it.


>devel mailing list