Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-20 12:37:10


Ok, I think we're mostly converged on a solution. This might not get
implemented immediately (got some other pending v1.3 stuff to bug fix,
etc.), but it'll happen for v1.3.

- endpoint creation will mpool alloc/register a small buffer for
handshake
- cpc does not need to call _post_recvs()); instead, it can just post
the single small buffer on each BSRQ QP (from the small buffer on the
endpoint)
- cpc will call _connected() (in the main thread, not the CPC progress
thread) when all BSRQ QPs are connected
   - if _post_recvs() was previously called, do the normal "finish
setting up" stuff and declare the endpoint CONNECTED
   - if _post_recvs() was not previously called, then:
     - call _post_recvs()
     - send a short CTS message on the 1st BSRQ QP
     - wait for CTS from peer
     - when both CTS from peer has arrived *and* we have sent our CTS,
declare endpoint CONNECTED

Doing it this way adds no overhead to OOB/XOOB (who don't need this
extra handshake). I think the code can be factored nicely to make
this not too complicated.

I'll work on this once I figure out the memory corruption I'm seeing
in the receive_queues patch...

Note that this addresses the wireup multi-threading issues -- not
iWarp SRQ issues. We'll tackle those separately, and possibly not for
the initial v1.3.0 release.

On May 20, 2008, at 6:02 AM, Gleb Natapov wrote:

> On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
>>>> 5. ...?
>>> What about moving posting of receive buffers into main thread. With
>>> SRQ it is easy: don't post anything in CPC thread. Main thread will
>>> prepost buffers automatically after first fragment received on the
>>> endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
>>> complicated. What if we'll prepost dummy buffers (not from free
>>> list)
>>> during IBCM connection stage and will run another three way
>>> handshake
>>> protocol using those buffers, but from the main thread. We will need
>>> to
>>> prepost one buffer on the active side and two buffers on the passive
>>> side.
>>
>>
>> This is probably the most viable alternative -- it would be easiest
>> if
>> we did this for all CPC's, not just for IBCM:
>>
>> - for PPRQ: CPCs only post a small number of receive buffers,
>> suitable
>> for another handshake that will run in the upper-level openib BTL
>> - for SRQ: CPCs don't post anything (because the SRQ already
>> "belongs"
>> to the upper level openib BTL)
>>
>> Do we have a BSRQ restriction that there *must* be at least one PPRQ?
> No. We don't have such restriction and I wouldn't want to add it.
>
>> If so, we could always run the upper-level openib BTL really-post-
>> the-
>> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
>> have the CPC post a single receive on this QP -- see below), which
>> would make things much easier. If we don't already have this
>> restriction, would we mind adding it? We have one PPRQ in our
>> default
>> receive_queues value, anyway.
> If there is not PPRQ then we can relay on RNR/retransmit logic in case
> there is not enough buffer in SRQ. We do that anyway in openib BTL
> code.
>
>>
>> With this rationale, once the CPC says "ok, all BSRQ QP's are
>> connected", then _endpoint.c can run a CTS handshake to post the
>> "real" buffers, where each side does the following:
>>
>> - CPC calls _endpoint_connected() to tell the upper level BTL that it
>> is fully connected (the function is invoked in the main thread)
>> - _endpoint_connected() posts all the "real" buffers to all the BSRQ
>> QP's on the endpoint
>> - _endpoint_connected() then sends a CTS control message to remote
>> peer via smallest RC PPRQ
>> - upon receipt of CTS:
>> - release the buffer (***)
>> - set endpoint state of CONNECTED and let all pending messages
>> flow... (as it happens today)
>>
>> So it actually doesn't even have to be a handshake -- it's just an
>> additional CTS sent over the newly-created RC QP. Since it's RC, we
>> don't have to do much -- just wait for the CTS to know that the
>> remote
>> side has actually posted all the receives that we expect it to have.
>> Since the CTS flows over a PPRQ, there's no issue about receiving the
>> CTS on an SRQ (because the SRQ may not have any buffers posted at any
>> given time).
> Correct. Full handshake is not needed. The trick is to allocate those
> initial buffers in a smart way. IMO initial buffer should be very
> small (a couple of bytes only) and be preallocated on endpoint
> creation.
> This will solve locking problem.
>
> --
> Gleb.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems