Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Threaded progress for CPCs
From: Jon Mason (jon_at_[hidden])
Date: 2008-05-19 15:09:50

On Mon, May 19, 2008 at 01:38:53PM -0400, Jeff Squyres wrote:
> On May 19, 2008, at 8:25 AM, Gleb Natapov wrote:
> > Is it possible to have sane SRQ implementation without HW flow
> > control?
> It seems pretty unlikely if the only available HW flow control is to
> terminate the connection. ;-)
> >> Even if we can get the iWARP semantics to work, this feels kinda
> >> icky. Perhaps I'm overreacting and this isn't a problem that needs
> >> to
> >> be fixed -- after all, this situation is no different than what
> >> happens after the initial connection, but it still feels icky.
> > What is so icky about it? Sender is faster than a receiver so flow
> > control
> > kicks in.
> My point is that we have no real flow control for SRQ.
> >> 2. The CM progress thread posts its own receive buffers when creating
> >> a QP (which is a necessary step in both CMs). However, this is
> >> problematic in two cases:
> >>
> > [skip]
> >
> > I don't like 1,2 and 3. :(
> >
> >> 4. Have a separate mpool for drawing initial receive buffers for the
> >> CM-posted RQs. We'd probably want this mpool to be always empty (or
> >> close to empty) -- it's ok to be slow to allocate / register more
> >> memory when a new connection request arrives. The memory obtained
> >> from this mpool should be able to be returned to the "main" mpool
> >> after it is consumed.
> >
> > This is slightly better, but still...
> Agreed; my reactions were pretty much the same as yours.
> >> 5. ...?
> > What about moving posting of receive buffers into main thread. With
> > SRQ it is easy: don't post anything in CPC thread. Main thread will
> > prepost buffers automatically after first fragment received on the
> > endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
> > complicated. What if we'll prepost dummy buffers (not from free list)
> > during IBCM connection stage and will run another three way handshake
> > protocol using those buffers, but from the main thread. We will need
> > to
> > prepost one buffer on the active side and two buffers on the passive
> > side.
> This is probably the most viable alternative -- it would be easiest if
> we did this for all CPC's, not just for IBCM:
> - for PPRQ: CPCs only post a small number of receive buffers, suitable
> for another handshake that will run in the upper-level openib BTL
> - for SRQ: CPCs don't post anything (because the SRQ already "belongs"
> to the upper level openib BTL)
> Do we have a BSRQ restriction that there *must* be at least one PPRQ?
> If so, we could always run the upper-level openib BTL really-post-the-
> buffers handshake over the smallest buffer size BSRQ RC PPRQ (i.e.,
> have the CPC post a single receive on this QP -- see below), which
> would make things much easier. If we don't already have this
> restriction, would we mind adding it? We have one PPRQ in our default
> receive_queues value, anyway.
> With this rationale, once the CPC says "ok, all BSRQ QP's are
> connected", then _endpoint.c can run a CTS handshake to post the
> "real" buffers, where each side does the following:
> - CPC calls _endpoint_connected() to tell the upper level BTL that it
> is fully connected (the function is invoked in the main thread)
> - _endpoint_connected() posts all the "real" buffers to all the BSRQ
> QP's on the endpoint
> - _endpoint_connected() then sends a CTS control message to remote
> peer via smallest RC PPRQ
> - upon receipt of CTS:
> - release the buffer (***)
> - set endpoint state of CONNECTED and let all pending messages
> flow... (as it happens today)
> So it actually doesn't even have to be a handshake -- it's just an
> additional CTS sent over the newly-created RC QP. Since it's RC, we
> don't have to do much -- just wait for the CTS to know that the remote
> side has actually posted all the receives that we expect it to have.
> Since the CTS flows over a PPRQ, there's no issue about receiving the
> CTS on an SRQ (because the SRQ may not have any buffers posted at any
> given time).
> (***) The CTS can even be a zero byte message (maybe with inline data
> if we need it?); we're just waiting for the *first* message to arrive
> on the smallest BSRQ PPQP. Here's a dumb question (because I've never
> tried it and am on a plane where I can't try it) -- can you post a 0
> byte buffer (or NULL) for a receive? This would make returning the
> buffer to the CPC much easier (i.e., you won't have to) because the
> CPC [thread] will post the receive, but the upper level openib BTL
> [main thread] will actually receive it.
> We still have to solve what happens with iWARP on SRQ's, but that's
> really a different issue. I don't know if the iWARP vendors have
> thought about this much yet...?

I like the idea of the cpc only posting enough buffers to handle its
connection setup. This sounds the most optimal for RDMACM, and there
can even be HW specific chunks if SRQ enabled iWARP adapters have
needs different from others. This also removes the need to muck with
the credits if a QP gets torn down for a reconnection (or to setup
dummy qps like we currently do in RDMACM).


> --
> Jeff Squyres
> Cisco Systems
> _______________________________________________
> devel mailing list
> devel_at_[hidden]