Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Galen Shipman (gshipman_at_[hidden])
Date: 2007-05-11 14:53:39


>
> More like trying to work around the race condition that exists: The
> server side sends an rdma message first thus violating the iwarp
> protocol. For those who want the gory details: when the server sends
> first -and- that rdma message arrives at the client _before_ the
> client
> transitions into rdma mode, then that rdma message gets passed up
> to the
> host driver as streaming mode data. So when I originally ran OMPI/
> udapl
> on chelsio's rnic, the client actually received the mpa start response
> -and- the first fpdu from the server while still in streaming mode.
> This results in a connection abort because it violates the mpa startup
> protocol...
>

I would recommend making a state transition diagram of the lazy
connection establishment. I did this during implementation of the
Open IB BTL. Of course, I threw it out as soon as the code was
committed :-).
This shouldn't take very long and then you can simply modify the
state diagram to prevent these race conditions, perhaps identifying
an existing state where you can post any buffers that you need. Don't
forget to throw away the state diagram as soon as the code is
committed ;-).

- Galen

> By causing the server to sleep just after accepting the connection, I
> give the client time to transition into rdma mode...
>
> It was just a hack to continue testing OMPI/udapl/chelsio. And it
> revealed the problem being discussed in this thread: OMPI udapl btl
> doesn't pre-post the recvs for the sendrecv exchange.
>
>
>>>> Neither the client nor the server side of the udapl btl
>>>> connection setup pre-post RECV buffers before connecting.
>>>> This can allow a SEND to arrive before a RECV buffer is
>>>> available. I _think_ IB will handle this issue by
>>>> retransmitting the SEND. Chelsio's iWARP device, however,
>>>> TERMINATEs the connection. My sleep() makes this condition
>>>> happen every time.
>>>>
>>>>
>>>>
>>>
>>> A compliant DAPL program also ensures that there are adequate
>>> receive buffers in place before the remote peer Sends. It is
>>> explicitly noted that failure to follow this real will invoke
>>> a transport/device dependent penalty. It may be that the sendq
>>> will be fenced, or it may be that the connection will be terminated.
>>>
>>> So any RDMA BTL should pre-post recv buffers before initiating or
>>> accepting a connection.
>>>
>>>
>>>
>> I know of no udapl restiction saying a recv must be posted before
>> a send.
>> And yes we do pre post recv buffers but since the BTL creates 2
>> connections per peer, one for eager size messages and one for max
>> size
>> messages the BTL needs to know which connection the current
>> endpoint is
>> to service so that it can post the proper size recv buffer.
>>
>> Also, I agree in theory the btl could potentially post the recv which
>> currently occurs in mca_btl_udapl_sendrecv before the connect or
>> accept
>> but I think in practise we had issue doing this and we had to wait
>> until
>> a DAT_CONNECTION_EVENT_ESTABLISHED was received.
>>
>
> Well I'm trying it now. It should work. If it doesn't, then dapl or
> the underlying providers are broken.
>
>>>
>>>
>>>
>>>>> From what I can tell, the udapl btl exchanges memory info as a
>>>>> first
>>>>>
>>>>>
>>>> order of business after connection establishment
>>>> (mba_btl_udapl_sendrecv(). The RECV buffer post for this
>>>> exchange, however, should really be done _before_ the
>>>> dat_ep_connect() on the active side, and _before_ the
>>>> dat_cr_accept() on the server side.
>>>> Currently its done after the ESTABLISHED event is dequeued,
>>>> thus allowing the race condition.
>>>>
>>>> I believe the rules are the ULP must ensure that a RECV is
>>>> posted before the client can post a SEND for that buffer.
>>>> And further, the ULP must enforce flow control somehow so
>>>> that a SEND never arrives without a RECV buffer being available.
>>>>
>>>>
>> maybe this is a rule iwarp imposes on its ULPs but not uDAPL.
>>
>>>> Perhaps this is just a bug and I opened it up with my sleep()
>>>>
>>>> Or is the uDAPL btl assuming the transport will deal with
>>>> lack of RECV buffer at the time a SEND arrives?
>>>>
>>>>
>> There may be a race condition here but you really have to try hard to
>> see it.
>>
>
> I agree its a small race condition that I made very large by my
> sleep! :-) But I can kill 2 birds with one stone here: I'm
> testing now
> a change to the sendrecv exchange to:
>
> 1) prepost the recvs just after dat_ep_create
>
> 2) force the side that issues the dat_ep_connect to send its addr-data
> first
>
> 3) force the side that issues the dat_cr_accept to wait for the send
> from the peer, then post its addr-data send
>
> This will plug both race conditions. I'll post the patch once I debug
> it and we can discuss if you thinks a good solution and/or if it work
> for solaris udapl.
>
>> From Steve previously.
>> "Also: Given there is a message exchange _always_ after connection
>> setup, then we can change that exchange to support the
>> client-must-send-first issue..."
>>
>> I agree I am sure we can do something but if it includes an
>> additional
>> message we should consider a mca parameter to govern this because the
>> connection wireup is already costly enough.
>>
>
> It won't add an additional message. Stay tuned for a patch!
>
> Steve.
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel