On Thu, 2007-05-10 at 23:10 -0400, Donald Kerr wrote:
> Caitlin Bestler wrote:
> >devel-bounces_at_[hidden] wrote:
> >>>There are two new issues so far:
> >>>1) this has uncovered a connection migration issue in the Chelsio
> >>>driver/firmware. We are developing and testing a fix for this now.
> >>>Should be ready tomorrow hopefully.
> >>I have a fix for the above issue and I can continue with OMPI testing.
> >>To work around the client-must-send issue, I put a nice fat
> >>sleep in the udapl btl right after it calls dat_cr_accept(),
> >>in mca_btl_udapl_accept_connect(). This, however, exposes
> >>another issue with the udapl btl:
> sleeping after accept? What are you trying to do here force a race
More like trying to work around the race condition that exists: The
server side sends an rdma message first thus violating the iwarp
protocol. For those who want the gory details: when the server sends
first -and- that rdma message arrives at the client _before_ the client
transitions into rdma mode, then that rdma message gets passed up to the
host driver as streaming mode data. So when I originally ran OMPI/udapl
on chelsio's rnic, the client actually received the mpa start response
-and- the first fpdu from the server while still in streaming mode.
This results in a connection abort because it violates the mpa startup
By causing the server to sleep just after accepting the connection, I
give the client time to transition into rdma mode...
It was just a hack to continue testing OMPI/udapl/chelsio. And it
revealed the problem being discussed in this thread: OMPI udapl btl
doesn't pre-post the recvs for the sendrecv exchange.
> >>Neither the client nor the server side of the udapl btl
> >>connection setup pre-post RECV buffers before connecting.
> >>This can allow a SEND to arrive before a RECV buffer is
> >>available. I _think_ IB will handle this issue by
> >>retransmitting the SEND. Chelsio's iWARP device, however,
> >>TERMINATEs the connection. My sleep() makes this condition
> >>happen every time.
> >A compliant DAPL program also ensures that there are adequate
> >receive buffers in place before the remote peer Sends. It is
> >explicitly noted that failure to follow this real will invoke
> >a transport/device dependent penalty. It may be that the sendq
> >will be fenced, or it may be that the connection will be terminated.
> >So any RDMA BTL should pre-post recv buffers before initiating or
> >accepting a connection.
> I know of no udapl restiction saying a recv must be posted before a send.
> And yes we do pre post recv buffers but since the BTL creates 2
> connections per peer, one for eager size messages and one for max size
> messages the BTL needs to know which connection the current endpoint is
> to service so that it can post the proper size recv buffer.
> Also, I agree in theory the btl could potentially post the recv which
> currently occurs in mca_btl_udapl_sendrecv before the connect or accept
> but I think in practise we had issue doing this and we had to wait until
> a DAT_CONNECTION_EVENT_ESTABLISHED was received.
Well I'm trying it now. It should work. If it doesn't, then dapl or
the underlying providers are broken.
> >>>From what I can tell, the udapl btl exchanges memory info as a first
> >>order of business after connection establishment
> >>(mba_btl_udapl_sendrecv(). The RECV buffer post for this
> >>exchange, however, should really be done _before_ the
> >>dat_ep_connect() on the active side, and _before_ the
> >>dat_cr_accept() on the server side.
> >>Currently its done after the ESTABLISHED event is dequeued,
> >>thus allowing the race condition.
> >>I believe the rules are the ULP must ensure that a RECV is
> >>posted before the client can post a SEND for that buffer.
> >>And further, the ULP must enforce flow control somehow so
> >>that a SEND never arrives without a RECV buffer being available.
> maybe this is a rule iwarp imposes on its ULPs but not uDAPL.
> >>Perhaps this is just a bug and I opened it up with my sleep()
> >>Or is the uDAPL btl assuming the transport will deal with
> >>lack of RECV buffer at the time a SEND arrives?
> There may be a race condition here but you really have to try hard to
> see it.
I agree its a small race condition that I made very large by my
sleep! :-) But I can kill 2 birds with one stone here: I'm testing now
a change to the sendrecv exchange to:
1) prepost the recvs just after dat_ep_create
2) force the side that issues the dat_ep_connect to send its addr-data
3) force the side that issues the dat_cr_accept to wait for the send
from the peer, then post its addr-data send
This will plug both race conditions. I'll post the patch once I debug
it and we can discuss if you thinks a good solution and/or if it work
for solaris udapl.
> From Steve previously.
> "Also: Given there is a message exchange _always_ after connection
> setup, then we can change that exchange to support the
> client-must-send-first issue..."
> I agree I am sure we can do something but if it includes an additional
> message we should consider a mca parameter to govern this because the
> connection wireup is already costly enough.
It won't add an additional message. Stay tuned for a patch!