Brian W. Barrett wrote:
> All -
> Jeff, Eugene, and I had a long discussion this morning on the sm BTL
> flow management issues and came to a couple of conclusions.
> * Jeff, Eugene, and I are all convinced that Eugene's addition of
> polling the receive queue to drain acks when sends start backing up is
> required for deadlock avoidance.
> * We're also convinced that George's proposal, while a good idea in
> general, is not sufficient. The send path doesn't appear to
> sufficiently progress the btl to avoid the deadlocks we're seeing with
> the SM btl today. Therefore, while I still recommend sizing the fifo
> appropriately and limiting the freelist size, I think it's not
> sufficient to solve all problems.
> * Finally, it took an hour, but we did determine one of the major
> differences between 1.2.8 and 1.3.0 in terms of sm is how messages
> were pulled off the FIFO. In 1.2.8 (and all earlier versions), we
> return from btl_progress after a single message is received (ack or
> message) or the fifo was empty. In 1.3.0 (pre-srq work Eugene did),
> we changed to completely draining all queues before returning from
> btl_progress. This has led to a situation where a single call to
> btl_progress can make a large number of callbacks into the PML
> (900,000 times in one of Eugene's test case). The change was made to
> resolve an issue Terry was having with performance of a benchmark.
> We've decided that it would be adventageous to try something between
> the two points and drain X number of messages from the queue, then
> return, where X is 100 or so at most. This should cover the
> performance issues Terry saw, but still not cause the huge number of
> messages added to the unexpected queue with a single call to
> MPI_Recv. Since a recv that is matched on the unexpected queue
> doesn't result in a call to opal_progress, this should help balance
> the load a little bit better. Eugene's going to take a stab at
> implementing this short term.
> I think the combination of Euegene's deadlock avoidance fix and the
> careful queue draining should make me comfortable enough to start
> another round of testing, but at least explains the bottom line issues.
> devel mailing list
IMHO, one should never process an unbounded number of elements from any
FIFO/socket/CQ/etc. because doing so risks starving other channels (some
of which might not exist yet at the time the work-without-bound code is
written). So, I think Brian's proposal (drain <= X; for 1 < X < inf) is
the correct approach, regardless of any of the other present concerns
w.r.t the sm blt.
In my own non-MPI experience, I have found that selection of such an X
is usually not a big deal - just find a value large enough to
effectively hide the cost of "entry" (analogy: if you hold a mutex the
critical section should be dominated by the work "inside", not the cost
of the lock/unlock operations). Once X is big enough that "entry" is
nominally free, then the type of performance issues I suspect Terry was
seeing will fade away. Beyond that point, further increases in X bring
rapidly diminishing returns in my experience, and risk starving some
other code path.
crude heuristic: start at X=2 and keep doubling it until performance of
the benchmark that concerned Terry are within a standard deviation
(difference is "in the noise") at X and X*2 (or within some other
tolerance of ones choice ). Then, of course, use the lower value, X
P.S. If there are other code paths that process elements without bound,
they probably deserve some scrutiny while this idea is fresh on people's
Paul H. Hargrove PHHargrove_at_[hidden]
Future Technologies Group Tel: +1-510-495-2352
HPC Research Department Fax: +1-510-486-6900
Lawrence Berkeley National Laboratory