Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-05-19 13:38:53

On May 19, 2008, at 8:25 AM, Gleb Natapov wrote:

> Is it possible to have sane SRQ implementation without HW flow
> control?

It seems pretty unlikely if the only available HW flow control is to
terminate the connection. ;-)

>> Even if we can get the iWARP semantics to work, this feels kinda
>> icky. Perhaps I'm overreacting and this isn't a problem that needs
>> to
>> be fixed -- after all, this situation is no different than what
>> happens after the initial connection, but it still feels icky.
> What is so icky about it? Sender is faster than a receiver so flow
> control
> kicks in.

My point is that we have no real flow control for SRQ.

>> 2. The CM progress thread posts its own receive buffers when creating
>> a QP (which is a necessary step in both CMs). However, this is
>> problematic in two cases:
> [skip]
> I don't like 1,2 and 3. :(
>> 4. Have a separate mpool for drawing initial receive buffers for the
>> CM-posted RQs. We'd probably want this mpool to be always empty (or
>> close to empty) -- it's ok to be slow to allocate / register more
>> memory when a new connection request arrives. The memory obtained
>> from this mpool should be able to be returned to the "main" mpool
>> after it is consumed.
> This is slightly better, but still...

Agreed; my reactions were pretty much the same as yours.

>> 5. ...?
> What about moving posting of receive buffers into main thread. With
> SRQ it is easy: don't post anything in CPC thread. Main thread will
> prepost buffers automatically after first fragment received on the
> endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> complicated. What if we'll prepost dummy buffers (not from free list)
> during IBCM connection stage and will run another three way handshake
> protocol using those buffers, but from the main thread. We will need
> to
> prepost one buffer on the active side and two buffers on the passive
> side.

This is probably the most viable alternative -- it would be easiest if
we did this for all CPC's, not just for IBCM:

- for PPRQ: CPCs only post a small number of receive buffers, suitable
for another handshake that will run in the upper-level openib BTL
- for SRQ: CPCs don't post anything (because the SRQ already "belongs"
to the upper level openib BTL)

Do we have a BSRQ restriction that there *must* be at least one PPRQ?
If so, we could always run the upper-level openib BTL really-post-the-
buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
have the CPC post a single receive on this QP -- see below), which
would make things much easier. If we don't already have this
restriction, would we mind adding it? We have one PPRQ in our default
receive_queues value, anyway.

With this rationale, once the CPC says "ok, all BSRQ QP's are
connected", then _endpoint.c can run a CTS handshake to post the
"real" buffers, where each side does the following:

- CPC calls _endpoint_connected() to tell the upper level BTL that it
is fully connected (the function is invoked in the main thread)
- _endpoint_connected() posts all the "real" buffers to all the BSRQ
QP's on the endpoint
- _endpoint_connected() then sends a CTS control message to remote
peer via smallest RC PPRQ
- upon receipt of CTS:
   - release the buffer (***)
   - set endpoint state of CONNECTED and let all pending messages
flow... (as it happens today)

So it actually doesn't even have to be a handshake -- it's just an
additional CTS sent over the newly-created RC QP. Since it's RC, we
don't have to do much -- just wait for the CTS to know that the remote
side has actually posted all the receives that we expect it to have.
Since the CTS flows over a PPRQ, there's no issue about receiving the
CTS on an SRQ (because the SRQ may not have any buffers posted at any
given time).

(***) The CTS can even be a zero byte message (maybe with inline data
if we need it?); we're just waiting for the *first* message to arrive
on the smallest BSRQ PPQP. Here's a dumb question (because I've never
tried it and am on a plane where I can't try it) -- can you post a 0
byte buffer (or NULL) for a receive? This would make returning the
buffer to the CPC much easier (i.e., you won't have to) because the
CPC [thread] will post the receive, but the upper level openib BTL
[main thread] will actually receive it.

We still have to solve what happens with iWARP on SRQ's, but that's
really a different issue. I don't know if the iWARP vendors have
thought about this much yet...?

Jeff Squyres
Cisco Systems