Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Gleb Natapov (glebn_at_[hidden])
Date: 2008-05-20 06:02:21

On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
> >> 5. ...?
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> > complicated. What if we'll prepost dummy buffers (not from free list)
> > during IBCM connection stage and will run another three way handshake
> > protocol using those buffers, but from the main thread. We will need
> > to
> > prepost one buffer on the active side and two buffers on the passive
> > side.
> This is probably the most viable alternative -- it would be easiest if
> we did this for all CPC's, not just for IBCM:
> - for PPRQ: CPCs only post a small number of receive buffers, suitable
> for another handshake that will run in the upper-level openib BTL
> - for SRQ: CPCs don't post anything (because the SRQ already "belongs"
> to the upper level openib BTL)
> Do we have a BSRQ restriction that there *must* be at least one PPRQ?
No. We don't have such restriction and I wouldn't want to add it.

> If so, we could always run the upper-level openib BTL really-post-the-
> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
> have the CPC post a single receive on this QP -- see below), which
> would make things much easier. If we don't already have this
> restriction, would we mind adding it? We have one PPRQ in our default
> receive_queues value, anyway.
If there is not PPRQ then we can relay on RNR/retransmit logic in case
there is not enough buffer in SRQ. We do that anyway in openib BTL code.

> With this rationale, once the CPC says "ok, all BSRQ QP's are
> connected", then _endpoint.c can run a CTS handshake to post the
> "real" buffers, where each side does the following:
> - CPC calls _endpoint_connected() to tell the upper level BTL that it
> is fully connected (the function is invoked in the main thread)
> - _endpoint_connected() posts all the "real" buffers to all the BSRQ
> QP's on the endpoint
> - _endpoint_connected() then sends a CTS control message to remote
> peer via smallest RC PPRQ
> - upon receipt of CTS:
> - release the buffer (***)
> - set endpoint state of CONNECTED and let all pending messages
> flow... (as it happens today)
> So it actually doesn't even have to be a handshake -- it's just an
> additional CTS sent over the newly-created RC QP. Since it's RC, we
> don't have to do much -- just wait for the CTS to know that the remote
> side has actually posted all the receives that we expect it to have.
> Since the CTS flows over a PPRQ, there's no issue about receiving the
> CTS on an SRQ (because the SRQ may not have any buffers posted at any
> given time).
Correct. Full handshake is not needed. The trick is to allocate those
initial buffers in a smart way. IMO initial buffer should be very
small (a couple of bytes only) and be preallocated on endpoint creation.
This will solve locking problem.