Ok, I think we're mostly converged on a solution. This might not get
implemented immediately (got some other pending v1.3 stuff to bug fix,
etc.), but it'll happen for v1.3.
- endpoint creation will mpool alloc/register a small buffer for
- cpc does not need to call _post_recvs()); instead, it can just post
the single small buffer on each BSRQ QP (from the small buffer on the
- cpc will call _connected() (in the main thread, not the CPC progress
thread) when all BSRQ QPs are connected
- if _post_recvs() was previously called, do the normal "finish
setting up" stuff and declare the endpoint CONNECTED
- if _post_recvs() was not previously called, then:
- call _post_recvs()
- send a short CTS message on the 1st BSRQ QP
- wait for CTS from peer
- when both CTS from peer has arrived *and* we have sent our CTS,
declare endpoint CONNECTED
Doing it this way adds no overhead to OOB/XOOB (who don't need this
extra handshake). I think the code can be factored nicely to make
this not too complicated.
I'll work on this once I figure out the memory corruption I'm seeing
in the receive_queues patch...
Note that this addresses the wireup multi-threading issues -- not
iWarp SRQ issues. We'll tackle those separately, and possibly not for
the initial v1.3.0 release.
On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:
> On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
>>>> 5. ...?
>>> What about moving posting of receive buffers into main thread. With
>>> SRQ it is easy: don't post anything in CPC thread. Main thread will
>>> prepost buffers automatically after first fragment received on the
>>> endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
>>> complicated. What if we'll prepost dummy buffers (not from free
>>> during IBCM connection stage and will run another three way
>>> protocol using those buffers, but from the main thread. We will need
>>> prepost one buffer on the active side and two buffers on the passive
>> This is probably the most viable alternative -- it would be easiest
>> we did this for all CPC's, not just for IBCM:
>> - for PPRQ: CPCs only post a small number of receive buffers,
>> for another handshake that will run in the upper-level openib BTL
>> - for SRQ: CPCs don't post anything (because the SRQ already
>> to the upper level openib BTL)
>> Do we have a BSRQ restriction that there *must* be at least one PPRQ?
> No. We don't have such restriction and I wouldn't want to add it.
>> If so, we could always run the upper-level openib BTL really-post-
>> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
>> have the CPC post a single receive on this QP -- see below), which
>> would make things much easier. If we don't already have this
>> restriction, would we mind adding it? We have one PPRQ in our
>> receive_queues value, anyway.
> If there is not PPRQ then we can relay on RNR/retransmit logic in case
> there is not enough buffer in SRQ. We do that anyway in openib BTL
>> With this rationale, once the CPC says "ok, all BSRQ QP's are
>> connected", then _endpoint.c can run a CTS handshake to post the
>> "real" buffers, where each side does the following:
>> - CPC calls _endpoint_connected() to tell the upper level BTL that it
>> is fully connected (the function is invoked in the main thread)
>> - _endpoint_connected() posts all the "real" buffers to all the BSRQ
>> QP's on the endpoint
>> - _endpoint_connected() then sends a CTS control message to remote
>> peer via smallest RC PPRQ
>> - upon receipt of CTS:
>> - release the buffer (***)
>> - set endpoint state of CONNECTED and let all pending messages
>> flow... (as it happens today)
>> So it actually doesn't even have to be a handshake -- it's just an
>> additional CTS sent over the newly-created RC QP. Since it's RC, we
>> don't have to do much -- just wait for the CTS to know that the
>> side has actually posted all the receives that we expect it to have.
>> Since the CTS flows over a PPRQ, there's no issue about receiving the
>> CTS on an SRQ (because the SRQ may not have any buffers posted at any
>> given time).
> Correct. Full handshake is not needed. The trick is to allocate those
> initial buffers in a smart way. IMO initial buffer should be very
> small (a couple of bytes only) and be preallocated on endpoint
> This will solve locking problem.
> devel mailing list