Eugene Loh wrote:
> Possibly, you meant to ask how one does directed polling with a wildcard
> source MPI_ANY_SOURCE. If that was your question, the answer is we
> punt. We report failure to the ULP, which reverts to the standard code
Sorry, I meant ANY_SOURCE. If you poll only the queue that correspond to
a posted receive, you only optimize micro-benchmarks, until they start
using ANY_SOURCE. So, does recvi() is a one-time shot ? Ie do you poll
the right queue only once and if it fails then you fall back on polling
all queues ? If yes, then it's unobtrusive but I don't think it would
help much. If you poll the right queue many times, then you have to
decide when to fall back on polling all queues, and it's not trivial.
>> How do you ensure you check all incoming queues from time to time to prevent flow control (specially if the queues are small for scaling) ?
> There are a variety of choices here. Further, I'm afraid we ultimately
> have to expose some of those choices to the user (MCA parameters or
In the vast majority of cases, users don't know how to turn the knobs.
The problem is that with local np going up, queue sizes will go down
fast (square root), and you will have to poll all queues more often.
Using more memory for queues just pushed the scalability wall a little
> congestion. What if then the user code posts a rather specific request
> (receive a message with a particular tag on a particular communicator
> from a particular source) and with high urgency (blocking request... "I
> ain't going anywhere until you give me what I'm asking for"). A good
> servant would drop whatever else s/he is doing to oblige the boss.
If you poll only one queue, then stuff can pile up on another and a
sender is now blocked. At best, you have a synchronization point. At
worst, a deadlock.
> So, let's say there's a standard MPI_Recv. Let's say there's also some
> congestion starting to build. What should the MPI implementation do?
The MPI implementation cannot trust the user/app to indicates where the
messages will come from. So, if you have N incoming queues, you need to
poll them all eventually. If you do, polling time increase linearly. If
you try to limit the polling space with whatever heuristic (like the
queue corresponding to the current blocking receive), then you take the
risk of not consuming fast enough another queue. And usually, the
heuristics quickly fall apart (ANY_SOURCE, multiple asynchronous
Really, only single-queue solves that.
> Yes, and you could toss the receive-side optimizations as well. So, one
> could say, "Our np=2 latency remains 2x slower than Scali's, but at
> least we no longer have that hideous scaling with large np." Maybe
> that's where we want to end up.
I think all optimizations except recvi() are fine and worth using. I am
just saying that the recvi() optimization is dubious as it is, and the
single-queue is potentially a larger hanging fruit on the recv side: it
could still be fast (spinlock or atomic to manage shared receive queue)
to have lower np=2 latency, and it would scale well with large np. No
tuning needed, no special cases, smaller memory footprint.
I will leave it at that, just some inputs.