On Sun, May 18, 2008 at 11:38:36AM -0400, Jeff Squyres wrote:
> ==> Remember that the goal for this work was to have a separate
> progress thread *without* all the heavyweight OMPI thread locks.
> Specifically: make it work in a build without --enable-progress-
> threads or --enable-mpi-threads (we did some preliminary testing with
> that stuff enabled and it had a big performance impact).
> 1. When CM progress thread completes an incoming connection, it sends
> a command down a pipe to the main thread indicating that a new
> endpoint is ready to use. The pipe message will be noticed by
> opal_progress() in the main thread and will run a function to do all
> necessary housekeeping (sets the endpoint state to CONNECTED, etc.).
> But it is possible that the receiver process won't dip into the MPI
> layer for a long time (and therefore not call opal_progress and the
> housekeeping function). Therefore, it is possible that with an active
> sender and a slow receiver, the sender can overwhelm an SRQ. On IB,
> this will just generate RNRs and be ok (we configure SRQs to have
> infinite RNRs), but I don't understand the semantics of what will
> happen on iWARP (it may terminate? I sent an off-list question to
> Steve Wise to ask for detail -- we may have other issues with SRQ on
> iWARP if this is the case, but let's skip that discussion for now).
Is it possible to have sane SRQ implementation without HW flow control?
Anyway the described problem exists with SRQ right now too. If receiver
doesn't enter progress for a long time sender can overwhelm an SRQ.
I don't see how this can be fixed without progress thread (and I am not
even sure that this is the problem that has to be fixed).
> Even if we can get the iWARP semantics to work, this feels kinda
> icky. Perhaps I'm overreacting and this isn't a problem that needs to
> be fixed -- after all, this situation is no different than what
> happens after the initial connection, but it still feels icky.
What is so icky about it? Sender is faster than a receiver so flow control
> 2. The CM progress thread posts its own receive buffers when creating
> a QP (which is a necessary step in both CMs). However, this is
> problematic in two cases:
I don't like 1,2 and 3. :(
> 4. Have a separate mpool for drawing initial receive buffers for the
> CM-posted RQs. We'd probably want this mpool to be always empty (or
> close to empty) -- it's ok to be slow to allocate / register more
> memory when a new connection request arrives. The memory obtained
> from this mpool should be able to be returned to the "main" mpool
> after it is consumed.
This is slightly better, but still...
> 5. ...?
What about moving posting of receive buffers into main thread. With
SRQ it is easy: don't post anything in CPC thread. Main thread will
prepost buffers automatically after first fragment received on the
endpoint (in btl_openib_handle_incoming()). With PPRQ it's more
complicated. What if we'll prepost dummy buffers (not from free list)
during IBCM connection stage and will run another three way handshake
protocol using those buffers, but from the main thread. We will need to
prepost one buffer on the active side and two buffers on the passive side.