Patrick Geoffray wrote:
Not sure I understand the question. So, maybe we start by being
explicitly about what we mean by "directed polling".
Eugene Loh wrote:
1) The work is already done.
How do you do "directed polling" with ANY_TAG ?
Currently, the sm BTL has connection-based FIFOs. That is, for each
on-node sender/receiver (directed) pair, there is a FIFO. For a
receiver to receive messages, it needs to check its in-bound FIFOs. It
can check all in-bound FIFOs all the time to discover messages. By
"directed polling", I mean that if the user posts a receive from a
specified source, we poll only the FIFO on which that message is
With that in mind, let's go back to your question. If a user posts a
receive with a specified source but a wildcard tag, we go to the
specified FIFO. We check the item on the FIFO's tail. We check if
this item is the one we're looking for. The "ANY_TAG" comes into play
only here, on the matching. It's unrelated to "directed polling",
which has to do only with the source process.
Possibly, you meant to ask how one does directed polling with a
wildcard source MPI_ANY_SOURCE. If that was your question, the answer
is we punt. We report failure to the ULP, which reverts to the
standard code path.
One alternative is, of course, the single receiver queue. I agree that
that alternative has many merits. To recap, however, the proposed
optimizations are already "in the bag" (implemented in a workspace) and
address some optimizations that are orthogonal to the "directed
polling" (and single receiver queue) approach. I think there are also
some uncertainties about the single recv queue approach, but I guess
I'll just have to prototype that alternative to explore those
There are a variety of choices here. Further, I'm afraid we ultimately
have to expose some of those choices to the user (MCA parameters or
How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ?
Let's say some congestion is starting to build on some internal OMPI
resource. Arguably, we should do something to start relieving that
congestion. What if then the user code posts a rather specific request
(receive a message with a particular tag on a particular communicator
from a particular source) and with high urgency (blocking request... "I
ain't going anywhere until you give me what I'm asking for"). A good
servant would drop whatever else s/he is doing to oblige the boss.
So, let's say there's a standard MPI_Recv. Let's say there's also some
congestion starting to build. What should the MPI implementation do?
A) If the receive can be completed "immediately", then do so and return
control to the user as soon as possible.
B) If the receive cannot be completed "immediately", fill your wait
time with general housekeeping like relieving congested resources.
C) Figure out what's on the critical path and do it.
At least A should be available for the user. Probably also B, and the
RFC proposal allows for that by rolling over to the traditional code
path when the request cannot be satisfied "immediately". (That said,
there are different definitions of "immediately" and different ways of
implementing all this.)
The definitions I've used for "immediately" include:
*) We know which FIFO to check.
*) The message is the next item on that FIFO.
*) The message is being delivered entirely in one chunk.
I am also going to add a time-out.
One could also mix a little bit of general polling in.
(Unfortunately), there is no end to all the artful tuning one could do.
I appreciate Jeff's explanation, but I still don't understand this
100%. The receive side looks to see if it can handle the request
"immediately". It checks to see if the next item on the specified FIFO
is "the one". If it is, it completes the request. If not, it returns
control to the ULP, who rolls over to the traditional code path.
What about the one-sided that Brian mentioned where there is no corresponding receive to tell you which queue to poll ?
I don't 100% know how to handle the concern you/Brian raise, but I have
the PML passing the flag MCA_PML_OB1_HDR_TYPE_MATCH into the BTL,
saying "this is the kind of message to look for". Does this address
the concern? The intent is that if it encounters something it doesn't
know how to handle, it reverts to the traditional receive code path.
Again, important speedups appear to be achievable if one bypasses the
PML receive-request data structure. So, we're talking about
optimizations that are orthogonal to the single-queue issue.
If you want to handle all the constraints, a single-queue model is much less work in the end, IMHO.
Right. Very attractive. I'm not ruling out the single-queue model.
2) The single-queue model addresses only one of the RFC's issues.
The single-queue model addresses not only the latency overhead when
scaling, but also the exploding memory footprint.
Yes, and you could toss the receive-side optimizations as well. So,
one could say, "Our np=2 latency remains 2x slower than Scali's, but at
least we no longer have that hideous scaling with large np." Maybe
that's where we want to end up.
By experience, the linear overhead of polling N queues very quickly
become greater than all the optimizations you can do on the send side.