Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-21 09:26:15


One more point that Pasha and I hashed out yesterday in IM...

To avoid the problem of posting a short handshake buffer to already-
existing SRQs, we will only do the extra handshake if there are PPRQ's
in receive_queues. The handshake will go across the smallest PPRQ,
and represent all QPs in receive_queues (even the SRQs).

If there are no PPRQ's in the receive_queues value, we'll just skip
the handshake and rely on IB's SRQ RNR retransmitting to fix any race
conditions.

One point that needs clarification: whether IBCM and RDMACM *require*
posting receive buffers on the new QP's. If so, this scheme will run
into trouble because we do not want to post any buffers on SRQs; that
gets racy and difficult to synchronize right (especially if multiple
remote peers are simultaneously trying to connect to a single SRQ).
I'll check this out today or tomorrow.

We'll have to re-visit this when iWARP NICs start supporting SRQ, but
if the above assumption is true (no need to post any receive buffers
for IBCM and RDMACM), it will be good enough for v1.3.

On May 20, 2008, at 12:37 PM, Jeff Squyres wrote:

> Ok, I think we're mostly converged on a solution. This might not get
> implemented immediately (got some other pending v1.3 stuff to bug fix,
> etc.), but it'll happen for v1.3.
>
> - endpoint creation will mpool alloc/register a small buffer for
> handshake
> - cpc does not need to call _post_recvs()); instead, it can just post
> the single small buffer on each BSRQ QP (from the small buffer on the
> endpoint)
> - cpc will call _connected() (in the main thread, not the CPC progress
> thread) when all BSRQ QPs are connected
> - if _post_recvs() was previously called, do the normal "finish
> setting up" stuff and declare the endpoint CONNECTED
> - if _post_recvs() was not previously called, then:
> - call _post_recvs()
> - send a short CTS message on the 1st BSRQ QP
> - wait for CTS from peer
> - when both CTS from peer has arrived *and* we have sent our CTS,
> declare endpoint CONNECTED
>
> Doing it this way adds no overhead to OOB/XOOB (who don't need this
> extra handshake). I think the code can be factored nicely to make
> this not too complicated.
>
> I'll work on this once I figure out the memory corruption I'm seeing
> in the receive_queues patch...
>
> Note that this addresses the wireup multi-threading issues -- not
> iWarp SRQ issues. We'll tackle those separately, and possibly not for
> the initial v1.3.0 release.
>
>
> On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:
>
>> On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
>>>>> 5. ...?
>>>> What about moving posting of receive buffers into main thread. With
>>>> SRQ it is easy: don't post anything in CPC thread. Main thread will
>>>> prepost buffers automatically after first fragment received on the
>>>> endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
>>>> complicated. What if we'll prepost dummy buffers (not from free
>>>> list)
>>>> during IBCM connection stage and will run another three way
>>>> handshake
>>>> protocol using those buffers, but from the main thread. We will
>>>> need
>>>> to
>>>> prepost one buffer on the active side and two buffers on the
>>>> passive
>>>> side.
>>>
>>>
>>> This is probably the most viable alternative -- it would be easiest
>>> if
>>> we did this for all CPC's, not just for IBCM:
>>>
>>> - for PPRQ: CPCs only post a small number of receive buffers,
>>> suitable
>>> for another handshake that will run in the upper-level openib BTL
>>> - for SRQ: CPCs don't post anything (because the SRQ already
>>> "belongs"
>>> to the upper level openib BTL)
>>>
>>> Do we have a BSRQ restriction that there *must* be at least one
>>> PPRQ?
>> No. We don't have such restriction and I wouldn't want to add it.
>>
>>> If so, we could always run the upper-level openib BTL really-post-
>>> the-
>>> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
>>> have the CPC post a single receive on this QP -- see below), which
>>> would make things much easier. If we don't already have this
>>> restriction, would we mind adding it? We have one PPRQ in our
>>> default
>>> receive_queues value, anyway.
>> If there is not PPRQ then we can relay on RNR/retransmit logic in
>> case
>> there is not enough buffer in SRQ. We do that anyway in openib BTL
>> code.
>>
>>>
>>> With this rationale, once the CPC says "ok, all BSRQ QP's are
>>> connected", then _endpoint.c can run a CTS handshake to post the
>>> "real" buffers, where each side does the following:
>>>
>>> - CPC calls _endpoint_connected() to tell the upper level BTL that
>>> it
>>> is fully connected (the function is invoked in the main thread)
>>> - _endpoint_connected() posts all the "real" buffers to all the BSRQ
>>> QP's on the endpoint
>>> - _endpoint_connected() then sends a CTS control message to remote
>>> peer via smallest RC PPRQ
>>> - upon receipt of CTS:
>>> - release the buffer (***)
>>> - set endpoint state of CONNECTED and let all pending messages
>>> flow... (as it happens today)
>>>
>>> So it actually doesn't even have to be a handshake -- it's just an
>>> additional CTS sent over the newly-created RC QP. Since it's RC, we
>>> don't have to do much -- just wait for the CTS to know that the
>>> remote
>>> side has actually posted all the receives that we expect it to have.
>>> Since the CTS flows over a PPRQ, there's no issue about receiving
>>> the
>>> CTS on an SRQ (because the SRQ may not have any buffers posted at
>>> any
>>> given time).
>> Correct. Full handshake is not needed. The trick is to allocate those
>> initial buffers in a smart way. IMO initial buffer should be very
>> small (a couple of bytes only) and be preallocated on endpoint
>> creation.
>> This will solve locking problem.
>>
>> --
>> Gleb.
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems