Sorry for the length of this mail. It's a complex issue. :-\
I did everything needed to enable the IB and RDMA CM's to have their
own progress threads to handle incoming CM traffic (which is important
because both CM's have timeouts for all their communications) and it
seems to be working fine for simple examples. I posted an hg of this
work (regularly kept in sync with the trunk):
But in talking to Pasha today, we realized that there are big problems
which will undoubtedly show up when running more than trivial examples.
==> Remember that the goal for this work was to have a separate
progress thread *without* all the heavyweight OMPI thread locks.
Specifically: make it work in a build without --enable-progress-
threads or --enable-mpi-threads (we did some preliminary testing with
that stuff enabled and it had a big performance impact).
1. When CM progress thread completes an incoming connection, it sends
a command down a pipe to the main thread indicating that a new
endpoint is ready to use. The pipe message will be noticed by
opal_progress() in the main thread and will run a function to do all
necessary housekeeping (sets the endpoint state to CONNECTED, etc.).
But it is possible that the receiver process won't dip into the MPI
layer for a long time (and therefore not call opal_progress and the
housekeeping function). Therefore, it is possible that with an active
sender and a slow receiver, the sender can overwhelm an SRQ. On IB,
this will just generate RNRs and be ok (we configure SRQs to have
infinite RNRs), but I don't understand the semantics of what will
happen on iWARP (it may terminate? I sent an off-list question to
Steve Wise to ask for detail -- we may have other issues with SRQ on
iWARP if this is the case, but let's skip that discussion for now).
Even if we can get the iWARP semantics to work, this feels kinda
icky. Perhaps I'm overreacting and this isn't a problem that needs to
be fixed -- after all, this situation is no different than what
happens after the initial connection, but it still feels icky.
2. The CM progress thread posts its own receive buffers when creating
a QP (which is a necessary step in both CMs). However, this is
problematic in two cases:
- If posting to an SRQ, the main thread may also be [re-]posting
to the SRQ at the same time. Those endpoint data structures therefore
need to be protected.
- All receive buffers come from the mpool, and therefore those
data structures need to be protected. Specifically: both threads may
post to the SRQ simultaneously, but the CM will always be the first to
post to a PPRQ. So although there's no race in the PPRQ endpoint data
structures, there is a potential for race issues in the mpool data
structures in both cases.
This is all a problem because we explicitly do not want to enable
*all* the heavyweight threading infrastructure for OMPI. I see a few
options, none of which seem attractive:
1. Somehow make it so only mpool and select other portions of OMPI can
have threading/lock support (although this seems like a slippery slope
-- I can foresee implications that would make it completely
meaningless to only have some thread locks enabled and not others).
This is probably the least attractive option.
2. Make the IB and RDMA CM requests be tolerant of timing out (and
just restarting). This is actually a lot of work; for example, the
IBCM CPC would then need to be tolerant of timing out anywhere in its
3-way handshake and starting over again. This could have serious
implications for when a connection will be able to actually complete
if a receiver rarely dips into the MPI layer (much worse than RDMA
CM's 2-way handshake).
3. Have locks around the critical areas described in #1 that can be
enabled without --enable-<foo>-threads support (perhaps disabled at
run time if we're not using a CM progress thread?).
4. Have a separate mpool for drawing initial receive buffers for the
CM-posted RQs. We'd probably want this mpool to be always empty (or
close to empty) -- it's ok to be slow to allocate / register more
memory when a new connection request arrives. The memory obtained
from this mpool should be able to be returned to the "main" mpool
after it is consumed.