Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Caitlin Bestler (caitlinb_at_[hidden])
Date: 2007-05-11 13:43:46

Donal Kerr wrote:

>>> order of business after connection establishment
>>> (mba_btl_udapl_sendrecv(). The RECV buffer post for this exchange,
>>> however, should really be done _before_ the
>>> dat_ep_connect() on the active side, and _before_ the
>>> dat_cr_accept() on the server side.
>>> Currently its done after the ESTABLISHED event is dequeued, thus
>>> allowing the race condition.
>>> I believe the rules are the ULP must ensure that a RECV is posted
>>> before the client can post a SEND for that buffer.
>>> And further, the ULP must enforce flow control somehow so that a
>>> SEND never arrives without a RECV buffer being available.
> maybe this is a rule iwarp imposes on its ULPs but not uDAPL.

It is most assuredly a rule for uDAPL. And it is not a matter
of iWARP "imposing" on uDAPL. uDAPL was explicitly designed
to support IB, iWARP and VI. To do that DAPL documents its
model of what RDMA is.

This issue is in fact one that is truly fundamental to the
efficiency of RDMA -- the transport layer DOES NOT provide
buffering. That's the application's job. It is precisely
because the application layer does a better job that RDMA
can achieve better performance at high bandwidth.

For reasons that have been discussed in more depth in the
RDMA applicability statement and in RDDP/IPS discussions
on iSER, the absence of transport layer buffer throttling
places the onus for end-to-end pacing on the application.
It is a situation somewhat akin to a car with a broken
spedometer that had previously only driven during rush
hour bumper-to-bumper traffic. The fact that the spedometer
was broken was irrelevant. But if that same car hits the
open road the driver will need to come up with some method
of regulating their speed.

The DAPL semantics are very clear that send/recv operations must
be matched one to one, that the receive buffer must be large
enough for the received message and that there must be a receive
buffer for each incoming send/recv message. That means that
the sender needs to have some basis for believing that the
RECV has been posted. Usually this is an explicit credit
that is decremented per message and incremented per response.

What DAPL does not state is if the transport does explicit flow
control so that the sending application's work request is simply
not processed (and the sending application continues to provide
the buffer, as with InfiniBand) or whether the sender simply
transmits and leaves error detection to the receiver (iWARP).
There are theoretical advantages to both, but more importantly
neither of them is going to change. So the Consumer of RDMA
applications needs to use ULP/application layer flow control
to pace the transmitter. At the application layer that means
that the RECV must be posted *before* the Send/accept that
grants ULP credits to the far side.

All of that should be clear in the IOV ownership rules and
discussion of the semantics of send/recv. If you thought you
saw something that implied any guarantees to the contrary
then could you point them out in a posting to the DAT reflector?
(or just send them to me or Arkady Kanevsky).