RFC: sm Latency
RFC: sm Latency
WHAT: Introducing optimizations to reduce ping-pong
latencies over the sm BTL.
WHY: This is a visible benchmark of MPI performance.
We can improve shared-memory latencies from 30% (if hardware
latency is the limiting factor) to 2× or more (if MPI
software overhead is the limiting factor). At high process
counts, the improvement can be 10× or more.
WHERE: Somewhat in the sm BTL, but very importantly
also in the PML. Changes can be seen in
ssh://www.open-mpi.org/~tdd/hg/fastpath.
WHEN: Upon acceptance. In time for OMPI 1.4.
TIMEOUT: February 6, 2009.
This RFC is being submitted by
eugene.loh@sun.com.
WHY (details)
The sm BTL typically has the lowest hardware latencies
of any BTL. Therefore, any OMPI software overhead we otherwise
tolerate becomes glaringly obvious in sm latency measurements.
In particular, MPI pingpong latencies are oft-cited performance
benchmarks, popular indications of the quality of an MPI implementation.
Competitive vendor MPIs optimize this metric aggressively, both
for np=2 pingpongs and for pairwise pingpongs for high
np (like the popular HPCC performance test suite).
Performance reported by HPCC include:
- MPI_Send()/MPI_Recv() pingpong latency.
- MPI_Send()/MPI_Recv() pingpong latency
as the number of connections grows.
- MPI_Sendrecv() latency.
The slowdown of latency as the number of sm connections
grows becomes increasingly important on large SMPs and ever more
prevalent many-core nodes.
Other MPI implementations, such as Scali and Sun HPC ClusterTools 6,
introduced such optimizations years ago.
Performance measurements indicate that the speedups we can expect
in OMPI with these optimizations range from 30% (np=2
measurements where hardware is the bottleneck) to 2×
(np=2 measurements where software is the bottleneck) to
over 10× (large np).
WHAT (details)
Introduce an optimized "fast path" for "immediate" sends and receives.
Several actions are recommended here.
1. Invoke the sm BTL sendi (send-immediate) function
Each BTL is allowed to define a "send immediate" (sendi)
function. A BTL is not required to do so, however, in which case
the PML calls the standard BTL send function.
A sendi function has already been written for sm,
but it has not been used due to insufficient testing.
The function should be reviewed, commented in, tested, and used.
The changes are:
- File: ompi/mca/btl/sm/btl_sm.c
Declaration/Definition: mca_btl_sm
Comment in the mca_btl_sm_sendi symbol
instead of the NULL placeholder so that the
already existing sendi function will be discovered
and used by the PML.
Function: mca_btl_sm_sendi()
Review the existing sm sendi code.
My suggestions include:
- Drop the test against the eager limit since the PML
calls this function only when the eager limit is
respected.
- Make sure the function has no side effects in the
case where it does not complete. See
Open Issues, the final
section of this document, for further discussion
of "side effects".
Mostly, I have reviewed the code and believe it's
already suitable for use.
2. Move the sendi call up higher in the PML
Profiling pingpong tests, we find that not so much time is spent
in the sm BTL. Rather, the PML consumes a lot of time
preparing a "send request". While these complex data structures
are needed to track progress of a long message that will be sent
in multiple chunks and progressed over multiple entries to and
exits from the MPI library, managing this large data structure
for an "immediate" send (one chunk, one call) is overkill. Latency
can be reduced noticeably if one bypasses this data structure.
This means invoking the sendi function as early as
possible in the PML.
The changes are:
- File: ompi/mca/pml/ob1/pml_ob1_isend.c
Function: mca_pml_ob1_send()
As soon as we enter the PML send function, try to call
the BTL sendi function. If this fails for whatever
reason, continue with the traditional PML send code path.
If it succeeds, then exit the PML and return up to the calling
layer without having to have wrestled with the PML send-request
data structure.
For better software management, the attempt to find and
use a BTL sendi function can be organized into
a new mca_pml_ob1_sendi() function.
- File: ompi/mca/pml/ob1/pml_ob1_sendreq.c
Function: mca_pml_ob1_send_request_start_copy()
Remove this attempt to call the BTL sendi function,
since we've already tried to do so higher up in the PML.
3. Introduce a BTL recvi call
While optimizing the send side of a pingpong operation is helpful,
it is less than half the job. At least as many savings are possible
on the receive side.
Corresponding to what we've done on the send side, on the receive
side we can attempt, as soon as we've entered the PML, to call a
BTL recvi (receive-immediate) function, bypassing the
creation of a complex "receive request" data structure that is
not needed if the receive can be completed immediately.
Further, we can perform directed polling. OMPI pingpong latencies
grow significantly as the number of sm connections increases,
while competitors (Scali, in any case) show absolutely flat latencies
with increasing np. The recvi function could check
one connection for the specified receive and exit quickly if that
message if found.
A BTL is granted considerable latitude in the proposed recvi
functions. The principle requirement is that the recvi
either completes the specified receive completely or else
behaves as if the function was not called at all. (That is, one should
be able to revert to the traditional code path without having to worry
about any recvi side effects. So, for example, if the
recvi function encounters any fragments being returned to
the process, it is permitted to return those fragments to the free list.)
While those are the "hard requirements" for recvi, there
are also some loose guidelines. Mostly, it is understood that
recvi should return "quickly" (a loose term to be interpreted
by the BTL). If recvi can quickly complete the specified
receive, great! If not, it should return control to the PML, who
can then execute the traditional code path, which can handle long
messages (multiple chunks, multiple entries into the MPI library)
and execute other "progress" functions.
The changes are:
- File: ompi/mca/btl/btl.h
In this file, we add a typedef declaration
for what a generic recvi should look like:
typedef int (*mca_btl_base_module_recvi_fn_t)();
We also add a btl_recvi field so that a BTL
can register its recvi function, if any.
- File:
ompi/mca/btl/elan/btl_elan.c
ompi/mca/btl/gm/btl_gm.c
ompi/mca/btl/mx/btl_mx.c
ompi/mca/btl/ofud/btl_ofud.c
ompi/mca/btl/openib/btl_openib.c
ompi/mca/btl/portals/btl_portals.c
ompi/mca/btl/sctp/btl_sctp.c
ompi/mca/btl/self/btl_self.c
ompi/mca/btl/sm/btl_sm.c
ompi/mca/btl/tcp/btl_tcp.c
ompi/mca/btl/template/btl_template.c
ompi/mca/btl/udapl/btl_udapl.c
Each BTL must add a recvi field to its module.
In most cases, BTLs will not define a
recvi function, and the field will be set to
NULL.
- File: ompi/mca/btl/sm/btl_sm.c
Function: mca_btl_sm_recvi()
For the sm BTL, we set the
field to the name of the BTL's recvi
function: mca_btl_sm_recvi.
We also add code to define the behavior of the function.
- File: ompi/mca/btl/sm/btl_sm.h
Prototype: mca_btl_sm_recvi()
We also add a prototype for the new function.
- File: ompi/mca/pml/ob1/pml_ob1_irecv.c
Function: mca_pml_ob1_recv()
As soon as we enter the PML, we try to find and use
a BTL's recvi function. If we succeed, we
can exit the PML without having had invoked the heavy-duty
PML receive-request data structure. If we fail, we
simply revert to the traditional PML receive code path,
without having to worry about any side effects that the
failed recvi might have left.
It is helpful to contain the recvi attempt
in a new mca_pml_ob1_recvi() function, which
we add.
- File: ompi/class/ompi_fifo.h
Function: ompi_fifo_probe_tail()
We don't want recvi to leave any side effects
if it encounters a message it is not prepared to handle.
Therefore, we need to be able to see what is on a FIFO
without popping that entry off the FIFO. Therefore, we
add this new function that probes the FIFO without
disturbing it.
4. Introduce an "immediate" data convertor
One of our aims here is to reduce
latency by bypassing expensive PML send and receive request data
structures. Again, these structures are useful when we intend
to complete a message over multiple chunks and multiple MPI
library invocations, but are overkill for a message that can be
completed all at once.
The same is true of data convertors. Convertors pack user data
into shared-memory buffers or unpack them on the receive side.
Convertors allow a message to be sent in multiple chunks, over
the course of multiple unrelated MPI calls, and for noncontiguous
datatypes. These sophisticated data structures are overkill in
some important cases, such as messages that are handled in a
single chunk and in a single MPI call and consist of a single
contiguous block of data.
While data convertors are not typically too expensive, for
shared-memory latency, where all other costs have been pared back
to a minimum, convertors become noticeable -- around 10%.
Therefore, we recognize special cases where we can have barebones,
minimal, data convertors. In these cases, we initialize the
convertor structure minimally -- e.g., a buffer address, a
number of bytes to copy, and a flag indicating that all other
fields are uninitialized. If this is not possible (e.g., because
a non-contiguous user-derived datatype is being used), the
"immediate" send or receive uses data convertors normally.
The changes are:
- File: ompi/datatype/convertor.h
First, we add to the convertor flags a new flag
#define CONVERTOR_IMMEDIATE 0x10000000
to identify a data convertor that has been initialized
only minimally.
Further, we add three new functions:
- ompi_convertor_immediate():
try to form an "immediate" convertor
- ompi_convertor_immediate_pack():
use an "immediate" convertor to pack
- ompi_convertor_immediate_unpack():
use an "immediate" convertor to unpack
- File: ompi/mca/btl/sm/btl_sm.c
Function: mca_btl_sm_sendi and
mca_btl_sm_recvi
Use the "immediate" convertor routines to pack/unpack.
- File: ompi/mca/pml/ob1/pml_ob1_isend.c and
ompi/mca/pml/ob1/pml_ob1_irecv.c
Have the PML fast path try to construct an "immediate"
convertor.
5. Introduce an "immediate" MPI_Sendrecv()
The optimizations described here should be extended to
MPI_Sendrecv() operations. In particular, while
MPI_Send() and MPI_Recv() optimizations
improve HPCC "pingpong" latencies, we need MPI_Sendrecv()
optimizations to improve HPCC "ring" latencies.
One challenge is the current OMPI MPI/PML interface. Today,
the OMPI MPI layer breaks a Sendrecv call up into
Irecv/Send/Wait. This would seem
to defeat fast-path optimizations at least for the receive.
Some options include:
- allow the MPI layer to call "fast path" operations
- have the PML layer provide a Sendrecv interface
- have the MPI layer emit Isend/Recv/Wait
and see how effectively one can optimize the Isend
operation in the PML for the "immediate" case
Performance Measurements: Before Optimization
More measurements are desirable, but here is a sampling of data
that I happen to have from platforms that I happened to have access
to. This data characterizes OMPI today, without fast-path optimizations.
OMPI versus Other MPIs
Here are pingpong latencies, in µsec, measured with
the OSU latency test for 0 and 8 bytes.
0-byte 8-byte
OMPI 0.74 0.84 µsec
MPICH 0.70 0.77
We see OMPI lagging MPICH.
Scali and HP MPI are presumably considerably faster,
but I have no recent data.
Among other things, one can see that there is about a 10%
penalty for invoking data convertors.
Scaling with Process Count
Here are HPCC pingpong latencies from a different, older,
platform. Though only two processes participate in the pingpong,
the HPCC test reports that latency for different numbers of
processes in the job. We see that OMPI performance slows
dramatically as the number of processes is increased.
Scali (data not available) does not show such a slowdown.
np min avg max
2 2.688 2.719 2.750 usec
4 2.812 2.875 3.000
6 2.875 3.050 3.250
8 2.875 3.299 3.625
10 2.875 3.447 3.812
12 3.063 3.687 4.375
16 2.687 4.093 5.063
20 2.812 4.492 6.000
24 3.125 5.026 6.562
28 3.250 5.326 7.250
32 3.500 5.830 8.375
36 3.750 6.199 8.938
40 4.062 6.753 10.187
The data show large min-max variations in latency. These variations
happen to depend on sender and receiver ranks. Here are latencies
(rounded down to the nearst µsec) for the np=40 case
as a function of sender and receiver rank:
--------- rank of one process ----------->
- 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7 6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7
9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8 7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7
9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7 7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6
9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7 8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7
9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7 8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7
9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8 6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6
9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7 7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6
9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7
9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7 7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6
9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7 7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6
9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7 6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6
| 9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7 7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5
| 8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7 6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6
| 8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7 7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6
| 6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7 8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5
| 7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8 7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5
7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8 7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5
rank 8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5
of 7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7 7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4
the 8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8 - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4
other 8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7 7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5
process 7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8 7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4
7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7 7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5
| 7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7 7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4
| 7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7 6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4
| 6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6 7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4
| 7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7 6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4
| 7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6 6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4
V 7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6 7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5
6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5 6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4
6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6 6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4
7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6 5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4
7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4 5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4
7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5 4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5
7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5 5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4
7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4 4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4
8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5 5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4
7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4
7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4
7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4 5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 -
We see that there is a strong dependence on process rank.
Presumably, this is due to our polling loop. That is, even
if we receive our message, we still have to poll the higher
numbered ranks before we complete the receive operation.
Performance Measurements: After Optimization
We consider three metrics:
- HPCC "pingpong" latency
- OSU latency (0 bytes)
- OSU latency (8 bytes)
We report data for:
- OMPI "out of the box"
- after implementation of steps 1-2 (send side)
- after implementation of steps 1-3 (send and receive sides)
- after implementation of steps 1-4 (send and receive sides, plus data convertor)
The data are from machines that I just happened to have
available.
There is a bit of noise in these results, but the implications,
based on these and other measurements, are:
- There is some improvement from the send side.
- There is more improvement from the receive side.
- The data convertor improvements help a little more (a few percent)
for non-null messages.
- The degree of improvement depends on how fast the CPU is relative
to the memory -- that is, how important software overheads are
versus hardware latency.
- If the CPU is fast (and hardware latency is the bottleneck),
these improvements are less -- say, 20-30%.
- If the CPU is slow (and software costs are the bottleneck),
the improvements are more dramatic --
nearly a factor of 2 for non-null messages.
- As np is increased, latency stays flat. This can
represent a 10× or more improvement over out-of-the-box OMPI.
V20z
Here are results for a V20z (burl-ct-v20z-11):
HPCC OSU0 OSU8
out of box 838 770 850 nsec
Steps 1-2 862 770 860
Steps 1-3 670 610 670
Steps 1-4 642 580 610
F6900
Here are np=2 results from a 1.05-GHz (1.2?) UltraSPARC-IV F6900 server:
HPCC OSU0 OSU8
out of box 3430 2770 3340 nsec
Steps 1-2 2940 2660 3090
Steps 1-3 1854 1650 1880
Steps 1-4 1660 1640 1750
Here is the dependence on process count using HPCC:
OMPI
"out of the box" optimized
comm ----------------- -----------------
size min avg max min avg max
2 2688 2719 2750 1750 1781 1812 nsec
4 2812 2875 3000 1750 1802 1812
6 2875 3050 3250 1687 1777 1812
8 2875 3299 3625 1687 1773 1812
10 2875 3447 3812 1687 1789 1812
12 3063 3687 4375 1687 1796 1813
16 2687 4093 5063 1500 1784 1875
20 2812 4492 6000 1687 1788 1875
24 3125 5026 6562 1562 1776 1875
28 3250 5326 7250 1500 1764 1813
32 3500 5830 8375 1562 1755 1875
36 3750 6199 8938 1562 1755 1875
40 4062 6753 10187 1500 1742 1812
Note:
- At np=2, these optimizations lead to a 2×
improvement in shared-memory latency.
- Non-null messages incur more than a 10% penalty,
which is largely addressed by our data-convertor
optimization.
- At larger np, we maintain our fast performance
while OMPI "out of the box" keeps slowing down more and
more and more.
M9000
Here are results for a 128-core M9000. I think the system has:
- 2 hardware threads per core (but we only use one hardware thread per core)
- 4 cores per socket
- 4 sockets per board
- 4 boards per (half?)
- 2 (halves?) per system
As one separates the sender and receiver, hardware latency increases.
Here is the hierarchy:
latency (nsec) bandwidth (Mbyte/sec)
out-of-box fastpath out-of-box fastpath
(on-socket?) 810 480 2000 2000
(on-board?) 2050 1820 1900 1900
(half?) 3030 2840 1680 1680
3150 2960 1660 1660
Note:
- Latency benefits some hundreds of nsecs with fastpath.
- This latency improvement is striking when the hardware
latency is small, but less noticeable as as the hardware
latency increases.
- Bandwidth is not very sensitive to hardware latency
(due to prefetch) and not at all to fast-path optimizations.
Here are HPCC pingpong latencies for increasing process counts:
out-of-box fastpath
np ----------------- -----------------
min avg max min avg max
2 812 812 812 499 499 499
4 874 921 999 437 494 562
8 937 1847 2624 437 1249 1874
16 1062 2430 2937 437 1557 1937
32 1562 3850 5437 375 2211 2875
64 2687 8329 15874 437 2535 3062
80 3499 16854 41749 374 2647 3437
96 3812 31159 100812 374 2717 3437
128 5187 125774 335187 437 2793 3499
The improvements are tremendous:
- At low np, latencies are low since sender and
receiver can be colocated. Nevertheless, fast-path
optimizations provided a nearly 2× improvement.
- As np increases, fast-path latency also increases,
but this is due to higher hardware latencies. Indeed,
the "min" numbers even drop a little. The "max" fast-path
numbers basically only represent the increase in hardware
latency.
- As np increases, OMPI "out of the box" latency
suffers catastrophically. Not only is there the issue
of more connections to poll, but the polling behaviors
of non-participating processes wreak havoc on the performance
of measured processes.
We can separate the two sources of latency degradation by
putting the np-2 non-participating processes to sleep.
In that case, latency only rises to about 10-20 µsec.
So, polling of many connections causes a substantial rise
in latency, while the disturbance of hard-poll loops on
system performance is responsible for even more degradation.
Actually, even bandwidth benefits:
out-of-box fastpath
np -------------- -------------
min avg max min avg max
2 2015 2034 2053 2028 2039 2051
4 2002 2043 2077 1993 2032 2065
8 1888 1959 2035 1897 1969 2088
16 1863 1934 2046 1856 1937 2066
32 1626 1796 2038 1581 1798 2068
64 1557 1709 1969 1591 1729 2084
80 1439 1619 1902 1561 1706 2059
96 1281 1452 1722 1500 1689 2005
128 677 835 1276 893 1671 1906
Here, we see that even bandwidth suffers "out of the box"
as the number of hard-spinning processes increases. Note
the degradation in "out-of-box" average bandwidths as np
increases. In contrast, the "fastpath" average holds up well.
(The np=128 min fastpath number 893 Mbyte/sec is poor,
but analysis shows it to be a measurement outlier.)
MPI_Sendrecv()
We should also get these optimizations into MPI_Sendrecv() in
order to speed up the HPCC "ring" results. E.g., here are latencies
in µsecs for a performance measurement based on HPCC "ring" tests.
==================================================
np=64
natural random
"out of box" 11.7 10.9
fast path 8.3 6.2
fast path and 100 warmups 3.5 3.6
==================================================
np=128 latency
natural random
"out of box" 242.9 226.1
fast path 56.6 37.0
fast path and 100 warmups 4.2 4.1
==================================================
There happen to be two problems here:
- We need fast-path optimizations in MPI_Sendrecv()
for improved performance.
- The MPI collective operation preceding the "ring" measurement
has "ragged" exit times. So, the "ring" timing starts well
before all of the processes have entered that measurement.
This is a separate OMPI performance problem that must be
handled as well for good HPCC results.
Open Issues
Here are some open issues:
- Side effects. Should the sendi and
recvi functions leave any side effects if they
do not complete the specified operation?
To my taste, they should not.
Currently, however, the sendi function is expected
to allocate a descriptor if it can, even if it cannot
complete the entire send operation.
- recvi: BTL and match header.
An in-coming message starts
with a "match header", with such data as MPI source rank,
MPI communicator, and MPI tag for performing MPI message
matching. Presumably, the BTL knows nothing about this
header. Message matching is performed, for example, via
PML callback functions. We are aggressively trying to
optimize this code path, however, so we should consider
alternatives to that approach.
One alternative is simply for the BTL to perform a byte-by-byte
comparison between the received header and the specified
header. The PML already tells the BTL how many bytes are
in the header.
One problem with this approach is that the fast path would
not be able to support the wildcard tag MPI_ANY_TAG.
Further, it leaves open the question how one extracts information
(such as source or tag) from this header for the MPI_Status
structure.
We can imagine a variety of solutions here, but so far
we've implemented a very simple (even if architecturally
distasteful) solution: we hardwire information (previously
private to the PML) about the match header into the BTL.
That approach can be replaced with other solutions.
- MPI_Sendrecv() support. As discussed
earlier, we should support fast-path optimizations for
"immediate" send-receive operations. Again, this may
entail some movement of current OMPI architectural
boundaries.
Other optimizations that are needed for good HPCC results
include:
- reducing the degradation due to hard spin waits
- improving the performance of collective operations
(which "artificially" degrade HPCC "ring" test results)
|