First, the performance improvements look really nice.
A few questions:
  - How much of an abstraction violation does this introduce ?  This looks like the btl needs to start “knowing” about MPI level semantics.  Currently, the btl purposefully is ulp agnostic.  I ask for 2 reasons
       - you mention having the btl look at the match header (if I understood correctly)
       - not clear to me what you mean by returning the header to the list if the irecv does not complete.  If it does not complete, why not just pass the header back for further processing, if all this is happening at the pml level ?
  - The measurements seem to be very dual process specific.  Have you looked at the impact of these changes on other applications at the same process count ?  “Real” apps would be interesting, but even hpl would be a good start.
  The current sm implementation is aimed only at small smp node count, which was really the only relevant type of systems when this code was written 5 years ago.  For large core counts there is a rather simple change that could be put in that is simple to implement, and will give you flat scaling for the sort of tests you are running.  If you replace the fifo’s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple writers and single reader in non-threaded situation) there will be only one place to poll, regardless of the number of procs involved in the run.  One still needs other optimizations to lower the absolute latency – perhaps what you have suggested.  If one really has all N procs trying to write to the same fifo at once, performance will stink because of contention, but most apps don’t have that behaviour.

Rich


On 1/17/09 1:48 AM, "Eugene Loh" <Eugene.Loh@sun.com> wrote:




RFC:
sm Latency
WHAT:
 Introducing optimizations to reduce ping-pong latencies over the sm BTL.

WHY:  This is a visible benchmark of MPI performance. We can improve shared-memory latencies from 30% (if hardware latency is the limiting factor) to 2× or more (if MPI software overhead is the limiting factor).  At high process counts, the improvement can be 10× or more.

WHERE:  Somewhat in the
sm BTL, but very importantly also in the PML.  Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.

WHEN:  Upon acceptance.  In time for OMPI 1.4.

TIMEOUT:  February 6, 2009.

This RFC is being submitted by eugene.loh@sun.com.
WHY (details)
The
sm BTL typically has the lowest hardware latencies of any BTL.  Therefore, any OMPI software overhead we otherwise tolerate becomes glaringly obvious in sm latency measurements.

In particular, MPI pingpong latencies are oft-cited performance benchmarks, popular indications of the quality of an MPI implementation. Competitive vendor MPIs optimize this metric aggressively, both for
np=2 pingpongs and for pairwise pingpongs for high np (like the popular HPCC performance test suite).

Performance reported by HPCC include:
The slowdown of latency as the number of sm connections grows becomes increasingly important on large SMPs and ever more prevalent many-core nodes.

Other MPI implementations, such as Scali and Sun HPC ClusterTools 6, introduced such optimizations years ago.

Performance measurements indicate that the speedups we can expect in OMPI with these optimizations range from 30% (
np=2 measurements where hardware is the bottleneck) to 2× (np=2 measurements where software is the bottleneck) to over 10× (large np).
WHAT (details)
Introduce an optimized "fast path" for "immediate" sends and receives. Several actions are recommended here.
1.  Invoke the
sm BTL sendi (send-immediate) function
Each BTL is allowed to define a "send immediate" (sendi) function.  A BTL is not required to do so, however, in which case the PML calls the standard BTL send function.

A
sendi function has already been written for sm, but it has not been used due to insufficient testing.

The function should be reviewed, commented in, tested, and used.

The changes are: