Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: sm Latency
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-01-20 06:56:53


Richard Graham wrote:
> First, the performance improvements look really nice.
> A few questions:
> - How much of an abstraction violation does this introduce ? This
> looks like the btl needs to start ¡§knowing¡¨ about MPI level semantics.
> Currently, the btl purposefully is ulp agnostic. I ask for 2 reasons
> - you mention having the btl look at the match header (if I understood
> correctly)
> - not clear to me what you mean by returning the header to the list if
> the irecv does not complete. If it does not complete, why not just
> pass the header back for further processing, if all this is happening
> at the pml level ?
> - The measurements seem to be very dual process specific. Have you
> looked at the impact of these changes on other applications at the
> same process count ? ¡§Real¡¨ apps would be interesting, but even hpl
> would be a good start.
> The current sm implementation is aimed only at small smp node count,
> which was really the only relevant type of systems when this code was
> written 5 years ago. For large core counts there is a rather simple
> change that could be put in that is simple to implement, and will give
> you flat scaling for the sort of tests you are running. If you replace
> the fifo¡¦s with a single link list per process in shared memory, with
> senders to this process adding match envelopes atomically, with each
> process reading its own link list (multiple writers and single reader
> in non-threaded situation) there will be only one place to poll,
> regardless of the number of procs involved in the run. One still needs
> other optimizations to lower the absolute latency ¡V perhaps what you
> have suggested. If one really has all N procs trying to write to the
> same fifo at once, performance will stink because of contention, but
> most apps don¡¦t have that behaviour.
If I remember correctly you can get a slow down with method you mention
above even with a handful (4-6 processes) writing to the same destination.

--td

> Rich
>
>
> On 1/17/09 1:48 AM, "Eugene Loh" <Eugene.Loh_at_[hidden]> wrote:
>
>
>
> ------------------------------------------------------------------------
> *RFC: **sm Latency
> WHAT:* Introducing optimizations to reduce ping-pong latencies
> over the sm BTL.
>
> *WHY:* This is a visible benchmark of MPI performance. We can
> improve shared-memory latencies from 30% (if hardware latency is
> the limiting factor) to 2¡Ñ or more (if MPI software overhead is
> the limiting factor). At high process counts, the improvement can
> be 10¡Ñ or more.
>
> *WHERE:* Somewhat in the sm BTL, but very importantly also in the
> PML. Changes can be seen in ssh://www.open-mpi.org/~tdd/hg/fastpath.
>
> *WHEN:* Upon acceptance. In time for OMPI 1.4.
>
> *TIMEOUT:* February 6, 2009.
> ------------------------------------------------------------------------
> This RFC is being submitted by eugene.loh_at_sun.com.
> *WHY (details)
> *The sm BTL typically has the lowest hardware latencies of any
> BTL. Therefore, any OMPI software overhead we otherwise tolerate
> becomes glaringly obvious in sm latency measurements.
>
> In particular, MPI pingpong latencies are oft-cited performance
> benchmarks, popular indications of the quality of an MPI
> implementation. Competitive vendor MPIs optimize this metric
> aggressively, both for np=2 pingpongs and for pairwise pingpongs
> for high np (like the popular HPCC performance test suite).
>
> Performance reported by HPCC include:
>
> * MPI_Send()/MPI_Recv() pingpong latency.
> * MPI_Send()/MPI_Recv() pingpong latency as the number of
> connections grows.
> * MPI_Sendrecv() latency.
>
> The slowdown of latency as the number of sm connections grows
> becomes increasingly important on large SMPs and ever more
> prevalent many-core nodes.
>
> Other MPI implementations, such as Scali and Sun HPC ClusterTools
> 6, introduced such optimizations years ago.
>
> Performance measurements indicate that the speedups we can expect
> in OMPI with these optimizations range from 30% (np=2 measurements
> where hardware is the bottleneck) to 2¡Ñ (np=2 measurements where
> software is the bottleneck) to over 10¡Ñ (large np).
> *WHAT (details)
> *Introduce an optimized "fast path" for "immediate" sends and
> receives. Several actions are recommended here.
> *1. Invoke the **sm BTL sendi (send-immediate) function
> *Each BTL is allowed to define a "send immediate" (sendi)
> function. A BTL is not required to do so, however, in which case
> the PML calls the standard BTL send function.
>
> A sendi function has already been written for sm, but it has not
> been used due to insufficient testing.
>
> The function should be reviewed, commented in, tested, and used.
>
> The changes are:
>
> * *File*: ompi/mca/btl/sm/btl_sm.c
> * *Declaration/Definition*: mca_btl_sm
> *
>
> *
>
>
> * Comment in the mca_btl_sm_sendi symbol instead of the NULL
> placeholder so that the already existing sendi function will
> be discovered and used by the PML.
> *
>
> * *Function*: mca_btl_sm_sendi()
> *
>
> *
>
>
> * Review the existing sm sendi code. My suggestions include:
> o Drop the test against the eager limit since the PML
> calls this function only when the eager limit is
> respected.
> o Make sure the function has no side effects in the case
> where it does not complete. See Open Issues
> <#OpenIssues> , the final section of this document,
> for further discussion of "side effects".
> *
>
> *
>
>
> * Mostly, I have reviewed the code and believe it's already
> suitable for use.
>
> *2. Move the **sendi call up higher in the PML
> *Profiling pingpong tests, we find that not so much time is
> spent in the sm BTL. Rather, the PML consumes a lot of time
> preparing a "send request". While these complex data
> structures are needed to track progress of a long message
> that will be sent in multiple chunks and progressed over
> multiple entries to and exits from the MPI library, managing
> this large data structure for an "immediate" send (one
> chunk, one call) is overkill. Latency can be reduced
> noticeably if one bypasses this data structure. This means
> invoking the sendi function as early as possible in the PML.
>
> The changes are:
> o *File*: ompi/mca/pml/ob1/pml_ob1_isend.c
> o *Function*: mca_pml_ob1_send()
> o
>
> o
>
>
> o As soon as we enter the PML send function, try to call
> the BTL sendi function. If this fails for whatever
> reason, continue with the traditional PML send code
> path. If it succeeds, then exit the PML and return up
> to the calling layer without having to have wrestled
> with the PML send-request data structure.
> o
>
> o For better software management, the attempt to find
> and use a BTL sendi function can be organized into a
> new mca_pml_ob1_sendi() function.
> o *File*: ompi/mca/pml/ob1/pml_ob1_sendreq.c
> o *Function*: mca_pml_ob1_send_request_start_copy()
> o
>
> o
>
>
> o Remove this attempt to call the BTL sendi function,
> since we've already tried to do so higher up in the PML.
> *3. Introduce a BTL **recvi call
> *While optimizing the send side of a pingpong
> operation is helpful, it is less than half the job. At
> least as many savings are possible on the receive side.
>
> Corresponding to what we've done on the send side, on
> the receive side we can attempt, as soon as we've
> entered the PML, to call a BTL recvi
> (receive-immediate) function, bypassing the creation
> of a complex "receive request" data structure that is
> not needed if the receive can be completed immediately.
>
> Further, we can perform directed polling. OMPI
> pingpong latencies grow significantly as the number of
> sm connections increases, while competitors (Scali, in
> any case) show absolutely flat latencies with
> increasing np. The recvi function could check one
> connection for the specified receive and exit quickly
> if that message if found.
>
> A BTL is granted considerable latitude in the proposed
> recvi functions. The principle requirement is that the
> recvi /either/ completes the specified receive
> completely /or else/ behaves as if the function was
> not called at all. (That is, one should be able to
> revert to the traditional code path without having to
> worry about any recvi side effects. So, for example,
> if the recvi function encounters any fragments being
> returned to the process, it is permitted to return
> those fragments to the free list.)
>
> While those are the "hard requirements" for recvi,
> there are also some loose guidelines. Mostly, it is
> understood that recvi should return "quickly" (a loose
> term to be interpreted by the BTL). If recvi can
> quickly complete the specified receive, great! If not,
> it should return control to the PML, who can then
> execute the traditional code path, which can handle
> long messages (multiple chunks, multiple entries into
> the MPI library) and execute other "progress" functions.
>
> The changes are:
> + *File*: ompi/mca/btl/btl.h
> +
>
> +
>
>
> + In this file, we add a typedef declaration for
> what a generic recvi should look like:
> +
>
> + typedef int (*mca_btl_base_module_recvi_fn_t)();
> +
> +
>
> +
>
>
> + We also add a btl_recvi field so that a BTL can
> register its recvi function, if any.
> + *File*:
> + ompi/mca/btl/elan/btl_elan.c
> + ompi/mca/btl/gm/btl_gm.c
> + ompi/mca/btl/mx/btl_mx.c
> + ompi/mca/btl/ofud/btl_ofud.c
> + ompi/mca/btl/openib/btl_openib.c
> + ompi/mca/btl/portals/btl_portals.c
> + ompi/mca/btl/sctp/btl_sctp.c
> + ompi/mca/btl/self/btl_self.c
> + ompi/mca/btl/sm/btl_sm.c
> + ompi/mca/btl/tcp/btl_tcp.c
> + ompi/mca/btl/template/btl_template.c
> + ompi/mca/btl/udapl/btl_udapl.c
> +
>
> +
>
>
> + Each BTL must add a recvi field to its module.
> In most cases, BTLs will not define a recvi
> function, and the field will be set to NULL.
> + *File*: ompi/mca/btl/sm/btl_sm.c
> + *Function*: mca_btl_sm_recvi()
> +
>
> +
>
>
> + For the sm BTL, we set the field to the name of
> the BTL's recvi function: mca_btl_sm_recvi. We
> also add code to define the behavior of the
> function.
> + *File*: ompi/mca/btl/sm/btl_sm.h
> + *Prototype*: mca_btl_sm_recvi()
> +
>
> +
>
>
> + We also add a prototype for the new function.
> + *File*: ompi/mca/pml/ob1/pml_ob1_irecv.c
> + *Function*: mca_pml_ob1_recv()
> +
>
> +
>
>
> + As soon as we enter the PML, we try to find and
> use a BTL's recvi function. If we succeed, we
> can exit the PML without having had invoked the
> heavy-duty PML receive-request data structure.
> If we fail, we simply revert to the traditional
> PML receive code path, without having to worry
> about any side effects that the failed recvi
> might have left.
> +
>
> + It is helpful to contain the recvi attempt in a
> new mca_pml_ob1_recvi() function, which we add.
> + *File*: ompi/class/ompi_fifo.h
> + *Function*: ompi_fifo_probe_tail()
> +
>
> +
>
>
> + We don't want recvi to leave any side effects if
> it encounters a message it is not prepared to
> handle. Therefore, we need to be able to see
> what is on a FIFO without popping that entry off
> the FIFO. Therefore, we add this new function
> that probes the FIFO without disturbing it.
> *4. Introduce an "immediate" data convertor
> *One of our aims here is to reduce latency by
> bypassing expensive PML send and receive request
> data structures. Again, these structures are
> useful when we intend to complete a message over
> multiple chunks and multiple MPI library
> invocations, but are overkill for a message that
> can be completed all at once.
>
> The same is true of data convertors. Convertors
> pack user data into shared-memory buffers or
> unpack them on the receive side. Convertors
> allow a message to be sent in multiple chunks,
> over the course of multiple unrelated MPI calls,
> and for noncontiguous datatypes. These
> sophisticated data structures are overkill in
> some important cases, such as messages that are
> handled in a single chunk and in a single MPI
> call and consist of a single contiguous block of
> data.
>
> While data convertors are not typically too
> expensive, for shared-memory latency, where all
> other costs have been pared back to a minimum,
> convertors become noticeable -- around 10%.
>
> Therefore, we recognize special cases where we
> can have barebones, minimal, data convertors. In
> these cases, we initialize the convertor
> structure minimally -- e.g., a buffer address, a
> number of bytes to copy, and a flag indicating
> that all other fields are uninitialized. If this
> is not possible (e.g., because a non-contiguous
> user-derived datatype is being used), the
> "immediate" send or receive uses data convertors
> normally.
>
> The changes are:
> # *File*: ompi/datatype/convertor.h
> #
>
> #
>
>
> # First, we add to the convertor flags a new
> flag
> #
>
> # #define CONVERTOR_IMMEDIATE 0x10000000
> #
> # to identify a data convertor that has been
> initialized only minimally.
> #
>
> # Further, we add three new functions:
> * ompi_convertor_immediate(): try to
> form an "immediate" convertor
> * ompi_convertor_immediate_pack(): use
> an "immediate" convertor to pack
> * ompi_convertor_immediate_unpack():
> use an "immediate" convertor to unpack
> # *File*: ompi/mca/btl/sm/btl_sm.c
> # *Function*: mca_btl_sm_sendi and
> mca_btl_sm_recvi
> #
>
> #
>
>
> # Use the "immediate" convertor routines to
> pack/unpack.
> # *File*: ompi/mca/pml/ob1/pml_ob1_isend.c
> and ompi/mca/pml/ob1/pml_ob1_irecv.c
> #
>
> #
>
>
> # Have the PML fast path try to construct an
> "immediate" convertor.
> *5. Introduce an "immediate" **MPI_Sendrecv()
> *The optimizations described here should
> be extended to MPI_Sendrecv() operations.
> In particular, while MPI_Send() and
> MPI_Recv() optimizations improve HPCC
> "pingpong" latencies, we need
> MPI_Sendrecv() optimizations to improve
> HPCC "ring" latencies.
>
> One challenge is the current OMPI MPI/PML
> interface. Today, the OMPI MPI layer
> breaks a Sendrecv call up into
> Irecv/Send/Wait. This would seem to defeat
> fast-path optimizations at least for the
> receive. Some options include:
> * allow the MPI layer to call "fast
> path" operations
> * have the PML layer provide a
> Sendrecv interface
> * have the MPI layer emit
> Isend/Recv/Wait and see how
> effectively one can optimize the
> Isend operation in the PML for the
> "immediate" case
> *Performance Measurements: Before Optimization
> *More measurements are desirable, but here
> is a sampling of data that I happen to
> have from platforms that I happened to
> have access to. This data characterizes
> OMPI today, without fast-path optimizations.
> *OMPI versus Other MPIs
> *Here are pingpong latencies, in £gsec,
> measured with the OSU latency test for 0
> and 8 bytes.
>
> 0-byte 8-byte
>
> OMPI 0.74 0.84 £gsec
> MPICH 0.70 0.77
> We see OMPI lagging MPICH.
>
> Scali and HP MPI are presumably
> /considerably/ faster, but I have no
> recent data.
>
> Among other things, one can see that there
> is about a 10% penalty for invoking data
> convertors.
> *Scaling with Process Count
> *Here are HPCC pingpong latencies from a
> different, older, platform. Though only
> two processes participate in the pingpong,
> the HPCC test reports that latency for
> different numbers of processes in the job.
> We see that OMPI performance slows
> dramatically as the number of processes is
> increased. Scali (data not available) does
> not show such a slowdown.
>
> np min avg max
>
> 2 2.688 2.719 2.750 usec
> 4 2.812 2.875 3.000
> 6 2.875 3.050 3.250
> 8 2.875 3.299 3.625
> 10 2.875 3.447 3.812
> 12 3.063 3.687 4.375
> 16 2.687 4.093 5.063
> 20 2.812 4.492 6.000
> 24 3.125 5.026 6.562
> 28 3.250 5.326 7.250
> 32 3.500 5.830 8.375
> 36 3.750 6.199 8.938
> 40 4.062 6.753 10.187
> The data show large min-max variations in
> latency. These variations happen to depend
> on sender and receiver ranks. Here are
> latencies (rounded down to the nearst
> £gsec) for the np=40 case as a function of
> sender and receiver rank:
>
> --------- rank of one process ----------->
>
> - 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 7 7 7 7 7
> 6 7 8 7 7 7 7 7 6 7 7 7 6 7 7 7 7 6 7
> 9 - 9 9 9 9 9 9 9 9 8 8 8 8 8 8 7 7 7 7 8
> 7 7 7 7 7 6 7 7 7 7 7 6 7 6 7 7 7 7 7
> 9 9 - 9 9 9 9 9 9 9 8 9 7 7 7 8 9 7 7 7 7
> 7 7 7 7 7 6 7 8 6 7 7 7 7 7 7 6 7 7 6
> 9 910 - 9 9 9 8 8 8 7 9 7 8 7 7 7 8 8 7 7
> 8 7 7 6 7 7 7 7 7 6 6 7 6 7 7 7 7 7 7
> 9 9 9 9 - 9 9 9 8 8 8 7 7 8 7 8 8 8 7 7 7
> 8 8 7 6 6 7 8 7 7 6 6 7 7 6 7 7 6 7 7
> 9 9 9 9 9 - 9 9 9 8 7 7 8 8 8 7 8 7 7 8 8
> 6 7 7 6 7 7 7 7 6 6 6 7 7 7 7 6 6 6 6
> 9 9 9 9 9 9 - 9 9 8 9 8 8 8 7 8 8 7 8 6 7
> 7 7 7 7 7 6 6 7 7 6 7 6 7 6 7 7 6 7 6
> 9 9 9 9 9 9 9 - 9 8 8 8 8 9 8 7 8 7 8 7 7
> 6 7 7 7 7 7 6 7 7 7 7 7 7 7 7 6 7 7 7
> 9 9 8 9 9 8 8 9 - 7 9 9 9 7 7 7 8 8 8 7 7
> 7 6 7 7 7 6 7 6 6 6 6 7 6 7 6 6 6 7 6
> 9 9 9 9 7 7 8 8 8 - 8 9 8 7 7 7 8 7 7 7 7
> 7 7 7 7 7 6 6 7 6 7 6 7 7 6 7 7 6 6 6
> 9 9 9 9 9 8 9 9 7 9 - 8 7 8 7 7 6 8 7 7 7
> 6 7 7 7 7 7 7 6 6 6 6 7 7 7 6 6 7 7 6
> | 9 8 8 9 8 7 8 8 8 8 7 - 9 7 7 8 7 7 7 7
> 7 7 7 6 6 6 7 6 7 6 6 6 7 7 6 6 7 6 7 5
> | 8 8 9 8 9 7 7 8 8 7 7 8 - 7 8 9 8 7 7 7
> 6 6 7 7 7 7 7 6 7 6 7 7 7 6 7 6 6 6 6 6
> | 8 8 8 8 8 9 7 8 8 7 7 7 7 - 8 8 8 8 7 7
> 7 6 7 7 7 6 6 6 6 7 7 7 7 6 6 6 6 6 5 6
> | 6 7 9 9 9 7 7 8 7 7 8 7 8 7 - 6 8 7 7 7
> 8 7 7 7 7 6 6 7 7 7 6 7 6 7 7 6 6 6 4 5
> | 7 7 6 8 7 8 8 8 7 7 8 7 8 9 7 - 7 7 7 8
> 7 7 6 7 7 7 7 6 7 6 7 6 6 6 6 6 6 5 5 5
> 7 9 7 8 9 7 8 7 8 8 8 7 7 7 7 7 - 7 8 7 8
> 7 7 7 7 7 6 7 7 6 6 7 6 6 6 4 5 5 5 5
> rank 8 8 8 7 9 7 8 7 7 7 8 7 7 7 7 7 8 - 7
> 7 7 7 7 7 7 6 7 7 7 6 6 7 7 6 6 6 6 5 4 5
> of 7 7 7 8 6 8 6 7 8 7 6 7 7 7 7 7 7 7 - 7
> 7 7 7 7 7 6 6 7 6 6 6 6 6 6 6 6 6 5 5 4
> the 8 7 8 8 7 8 8 7 7 7 7 7 7 7 7 7 7 7 8
> - 7 7 7 7 7 7 7 6 7 6 6 6 6 5 5 5 5 5 4 4
> other 8 7 7 8 7 7 7 7 8 7 7 7 8 7 7 7 7 7
> 7 7 - 7 7 6 7 7 7 7 6 6 7 6 6 6 5 5 5 5 5 5
> process 7 6 6 7 7 7 8 7 7 6 6 7 6 7 6 7 8
> 7 7 8 7 - 7 7 7 7 7 7 6 6 6 6 6 5 5 5 4 4 4 4
> 7 8 7 7 7 7 7 7 8 8 7 7 7 7 7 6 7 6 7 7 7
> 7 - 7 7 7 7 6 6 6 4 5 5 6 4 4 4 6 5 5
> | 7 6 7 7 7 6 6 7 6 8 7 8 7 7 7 7 7 7 7 7
> 7 7 7 - 7 6 6 6 6 5 5 5 6 5 4 4 5 5 4 4
> | 7 7 7 6 7 7 7 7 8 7 6 7 6 6 7 6 6 6 6 7
> 6 7 7 7 - 6 6 6 5 5 5 5 5 4 4 5 6 4 5 4
> | 6 7 7 7 7 7 7 7 8 8 8 7 7 7 6 7 7 7 6 6
> 7 7 7 6 5 - 6 5 6 6 5 5 5 4 5 5 5 4 4 4
> | 7 7 6 7 7 7 7 8 7 7 7 7 6 7 7 7 7 7 6 7
> 6 6 6 5 5 4 - 5 5 5 4 5 5 5 4 5 5 4 4 4
> | 7 7 7 8 7 6 7 6 7 7 7 7 7 6 7 7 7 7 6 6
> 6 6 6 4 6 4 5 - 5 4 4 5 4 4 5 5 5 4 4 4
> V 7 6 8 7 7 6 6 7 6 7 7 7 7 7 6 7 7 6 6 6
> 7 6 6 5 6 5 5 4 - 4 5 5 4 4 4 4 4 4 4 5
> 6 6 6 6 6 6 7 8 7 6 7 7 7 7 6 6 7 6 6 5 5
> 6 6 5 5 6 5 5 4 - 5 4 4 4 4 4 4 6 4 4
> 6 6 6 7 6 7 7 7 7 6 7 7 6 6 7 7 7 6 6 6 6
> 6 5 4 4 4 5 4 4 4 - 5 5 4 4 4 4 4 4 4
> 7 6 7 6 6 6 7 7 7 6 7 7 6 6 6 7 6 6 6 5 6
> 5 5 5 5 4 4 4 5 5 6 - 4 4 4 4 4 4 4 4
> 7 7 6 6 6 6 6 7 7 7 6 7 6 7 7 7 6 6 5 5 4
> 5 5 4 4 4 4 5 4 4 5 4 - 4 4 4 5 4 4 4
> 7 6 7 6 6 6 6 6 7 7 7 7 6 7 6 6 6 6 6 5 5
> 4 5 4 4 4 4 4 4 4 4 4 4 - 5 4 4 4 4 5
> 7 6 7 7 7 8 7 7 6 6 6 7 6 6 6 6 5 5 4 5 5
> 5 4 4 5 4 4 4 4 4 4 4 4 4 - 4 4 4 4 4
> 7 6 7 6 7 6 6 6 6 6 7 7 6 6 6 6 5 5 5 4 4
> 4 4 4 5 4 4 4 4 4 4 4 4 4 4 - 4 4 4 4
> 8 6 6 7 7 7 7 8 7 6 6 7 6 6 6 6 5 4 5 4 5
> 5 4 5 4 4 5 4 4 4 4 5 5 4 4 4 - 4 4 4
> 7 7 7 6 7 7 6 7 6 6 7 6 6 6 6 5 4 5 4 5 4
> 4 4 4 4 4 4 4 4 5 4 4 4 4 4 4 4 - 4 4
> 7 7 7 7 7 6 7 7 6 7 7 7 7 5 4 5 5 4 5 5 4
> 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 - 4
> 7 6 7 7 6 7 6 6 6 6 6 6 6 6 5 5 6 4 4 5 4
> 5 4 5 4 4 4 4 5 4 4 4 5 4 4 4 4 4 4 -
> We see that there is a strong dependence
> on process rank. Presumably, this is due
> to our polling loop. That is, even if we
> receive our message, we still have to poll
> the higher numbered ranks before we
> complete the receive operation.
> *Performance Measurements: After Optimization
> *We consider three metrics:
> * HPCC "pingpong" latency
> * OSU latency (0 bytes)
> * OSU latency (8 bytes)
> We report data for:
> * OMPI "out of the box"
> * after implementation of steps 1-2
> (send side)
> * after implementation of steps 1-3
> (send and receive sides)
> * after implementation of steps 1-4
> (send and receive sides, plus data
> convertor)
> The data are from machines that I just
> happened to have available.
>
> There is a bit of noise in these results,
> but the implications, based on these and
> other measurements, are:
> * There is some improvement from the
> send side.
> * There is more improvement from the
> receive side.
> * The data convertor improvements help
> a little more (a few percent) for
> non-null messages.
> * The degree of improvement depends on
> how fast the CPU is relative to the
> memory -- that is, how important
> software overheads are versus
> hardware latency.
> o If the CPU is fast (and
> hardware latency is the
> bottleneck), these
> improvements are less -- say,
> 20-30%.
> o If the CPU is slow (and
> software costs are the
> bottleneck), the improvements
> are more dramatic -- nearly a
> factor of 2 for non-null
> messages.
> * As np is increased, latency stays
> flat. This can represent a 10¡Ñ or
> more improvement over out-of-the-box
> OMPI.
> *V20z
> *Here are results for a V20z
> (burl-ct-v20z-11):
>
> HPCC OSU0 OSU8
>
> out of box 838 770 850 nsec
> Steps 1-2 862 770 860
> Steps 1-3 670 610 670
> Steps 1-4 642 580 610
> *F6900
> *Here are np=2 results from a 1.05-GHz
> (1.2?) UltraSPARC-IV F6900 server:
>
> HPCC OSU0 OSU8
>
> out of box 3430 2770 3340 nsec
> Steps 1-2 2940 2660 3090
> Steps 1-3 1854 1650 1880
> Steps 1-4 1660 1640 1750
> Here is the dependence on process count
> using HPCC:
>
> OMPI
> "out of the box" optimized
> comm ----------------- -----------------
> size min avg max min avg max
>
> 2 2688 2719 2750 1750 1781 1812 nsec
> 4 2812 2875 3000 1750 1802 1812
> 6 2875 3050 3250 1687 1777 1812
> 8 2875 3299 3625 1687 1773 1812
> 10 2875 3447 3812 1687 1789 1812
> 12 3063 3687 4375 1687 1796 1813
> 16 2687 4093 5063 1500 1784 1875
> 20 2812 4492 6000 1687 1788 1875
> 24 3125 5026 6562 1562 1776 1875
> 28 3250 5326 7250 1500 1764 1813
> 32 3500 5830 8375 1562 1755 1875
> 36 3750 6199 8938 1562 1755 1875
> 40 4062 6753 10187 1500 1742 1812
> Note:
> * At np=2, these optimizations lead to
> a 2¡Ñ improvement in shared-memory
> latency.
> * Non-null messages incur more than a
> 10% penalty, which is largely
> addressed by our data-convertor
> optimization.
> * At larger np, we maintain our fast
> performance while OMPI "out of the
> box" keeps slowing down more and
> more and more.
> *M9000
> *Here are results for a 128-core M9000. I
> think the system has:
> * 2 hardware threads per core (but we
> only use one hardware thread per core)
> * 4 cores per socket
> * 4 sockets per board
> * 4 boards per (half?)
> * 2 (halves?) per system
> As one separates the sender and receiver,
> hardware latency increases. Here is the
> hierarchy:
>
> latency (nsec) bandwidth (Mbyte/sec)
> out-of-box fastpath out-of-box fastpath
> (on-socket?) 810 480 2000 2000
> (on-board?) 2050 1820 1900 1900
> (half?) 3030 2840 1680 1680
> 3150 2960 1660 1660
> Note:
> * Latency benefits some hundreds of
> nsecs with fastpath.
> * This latency improvement is striking
> when the hardware latency is small,
> but less noticeable as as the
> hardware latency increases.
> * Bandwidth is not very sensitive to
> hardware latency (due to prefetch)
> and not at all to fast-path
> optimizations.
> Here are HPCC pingpong latencies for
> increasing process counts:
>
> out-of-box fastpath
> np ----------------- -----------------
> min avg max min avg max
>
> 2 812 812 812 499 499 499
> 4 874 921 999 437 494 562
> 8 937 1847 2624 437 1249 1874
> 16 1062 2430 2937 437 1557 1937
> 32 1562 3850 5437 375 2211 2875
> 64 2687 8329 15874 437 2535 3062
> 80 3499 16854 41749 374 2647 3437
> 96 3812 31159 100812 374 2717 3437
> 128 5187 125774 335187 437 2793 3499
> The improvements are tremendous:
> * At low np, latencies are low since
> sender and receiver can be
> colocated. Nevertheless, fast-path
> optimizations provided a nearly 2¡Ñ
> improvement.
> * As np increases, fast-path latency
> also increases, but this is due to
> higher hardware latencies. Indeed,
> the "min" numbers even drop a
> little. The "max" fast-path numbers
> basically only represent the
> increase in hardware latency.
> * As np increases, OMPI "out of the
> box" latency suffers
> catastrophically. Not only is there
> the issue of more connections to
> poll, but the polling behaviors of
> non-participating processes wreak
> havoc on the performance of measured
> processes.
> *
>
> * We can separate the two sources of
> latency degradation by putting the
> np-2 non-participating processes to
> sleep. In that case, latency only
> rises to about 10-20 £gsec. So,
> polling of many connections causes a
> substantial rise in latency, while
> the disturbance of hard-poll loops
> on system performance is responsible
> for even more degradation.
> Actually, even bandwidth benefits:
>
> out-of-box fastpath
> np -------------- -------------
> min avg max min avg max
>
> 2 2015 2034 2053 2028 2039 2051
> 4 2002 2043 2077 1993 2032 2065
> 8 1888 1959 2035 1897 1969 2088
> 16 1863 1934 2046 1856 1937 2066
> 32 1626 1796 2038 1581 1798 2068
> 64 1557 1709 1969 1591 1729 2084
> 80 1439 1619 1902 1561 1706 2059
> 96 1281 1452 1722 1500 1689 2005
> 128 677 835 1276 893 1671 1906
> Here, we see that even bandwidth
> suffers "out of the box" as the
> number of hard-spinning processes
> increases. Note the degradation in
> "out-of-box" average bandwidths as
> np increases. In contrast, the
> "fastpath" average holds up well.
> (The np=128 min fastpath number 893
> Mbyte/sec is poor, but analysis
> shows it to be a measurement outlier.)
> *MPI_Sendrecv()
> *We should also get these
> optimizations into MPI_Sendrecv() in
> order to speed up the HPCC "ring"
> results. E.g., here are latencies in
> £gsecs for a performance measurement
> based on HPCC "ring" tests.
>
> ==================================================
> np=64
> natural random
>
> "out of box" 11.7 10.9
> fast path 8.3 6.2
> fast path and 100 warmups 3.5 3.6
> ==================================================
> np=128 latency
> natural random
>
> "out of box" 242.9 226.1
> fast path 56.6 37.0
> fast path and 100 warmups 4.2 4.1
> ==================================================
> There happen to be two problems here:
> o We need fast-path
> optimizations in
> MPI_Sendrecv() for improved
> performance.
> o The MPI collective operation
> preceding the "ring"
> measurement has "ragged" exit
> times. So, the "ring" timing
> starts well before all of the
> processes have entered that
> measurement. This is a
> separate OMPI performance
> problem that must be handled
> as well for good HPCC results.
> *Open Issues
> *Here are some open issues:
> o *Side effects*. Should the
> sendi and recvi functions
> leave any side effects if they
> do not complete the specified
> operation?
> o
>
> o To my taste, they should not.
> o
>
> o Currently, however, the sendi
> function is expected to
> allocate a descriptor if it
> can, even if it cannot
> complete the entire send
> operation.
> o *recvi**: BTL and match
> header*. An in-coming message
> starts with a "match header",
> with such data as MPI source
> rank, MPI communicator, and
> MPI tag for performing MPI
> message matching. Presumably,
> the BTL knows nothing about
> this header. Message matching
> is performed, for example, via
> PML callback functions. We are
> aggressively trying to
> optimize this code path,
> however, so we should consider
> alternatives to that approach.
> o
>
> o One alternative is simply for
> the BTL to perform a
> byte-by-byte comparison
> between the received header
> and the specified header. The
> PML already tells the BTL how
> many bytes are in the header.
> o
>
> o One problem with this approach
> is that the fast path would
> not be able to support the
> wildcard tag MPI_ANY_TAG.
> o
>
> o Further, it leaves open the
> question how one extracts
> information (such as source or
> tag) from this header for the
> MPI_Status structure.
> o
>
> o We can imagine a variety of
> solutions here, but so far
> we've implemented a very
> simple (even if
> architecturally distasteful)
> solution: we hardwire
> information (previously
> private to the PML) about the
> match header into the BTL.
> o
>
> o That approach can be replaced
> with other solutions.
> o *MPI_Sendrecv()** support*. As
> discussed earlier, we should
> support fast-path
> optimizations for "immediate"
> send-receive operations.
> Again, this may entail some
> movement of current OMPI
> architectural boundaries.
> Other optimizations that are
> needed for good HPCC results
> include:
> + reducing the degradation
> due to hard spin waits
> + improving the
> performance of
> collective operations
> (which "artificially"
> degrade HPCC "ring" test
> results)
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>