Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] knem/openmpi performance?
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2013-07-12 06:55:43


FWIW: a long time ago (read: many Open MPI / knem versions ago), I did a few benchmarks with knem vs. no knem Open MPI installations. IIRC, I used the typical suspects like NetPIPE, the NPBs, etc. There was a modest performance improvement (I don't remember the numbers offhand); it was a smaller improvement than I had hoped for -- particularly in point-to-point message passing latency (e.g., via NetPIPE).

Let me digress into a little background...

The normal non-knem shared memory pattern is to copy a message from the source buffer in the source process to an area in shared memory. The receiver then copies from the shared memory to the target buffer in its process. For large messages, this process is pipelined so that the receiver doesn't have to wait for the whole buffer to be copied to shared memory before it starts copying out to the target buffer. This is what's known as a "copy-in/copy-out" scheme -- you can think of it as 2 overlapping mem copies.

The knem shared memory implementation still uses the shared memory block for short messages, coordination, and rendezvous. But for large messages, the pipelined copy-in/copy-out is replaced with a direct copy from the source buffer in the source process to the target buffer in the receiver process (no pipelining is necessary, of course). So there's only 1 mem copy for the bulk of the large message.

There's an obvious difference here: the knem version uses 1 mem copy for the bulk of a large message, and the non-knem version uses 2 mem copies. So why wouldn't the knem version kick the non-knem version's butt?

I didn't dig deeply into it, but I rationalized that Open MPI's pipelined shared memory copies must be pretty good. If you view this on a timeline, it might look like this (skipping lots of details about the initial rendezvous, etc.):

Non-knem / copy-in/copy-out scheme
==================================

Sender copying to shmem T=N
   |----------------------------------------------------|
        |----------------------------------------------------|
     Receiver copying from shmem T=N+x

You can see that the completion time is T=N+x, where x is some small number.

Knem scheme
===========

Sender copying to receiver T=N
   |----------------------------------------------------|

The completion time here is T=N -- not T=N+x.

(disclaimer: it's been a loooong time since I've looked at the code; I don't remember if, in OMPI's knem scheme, the sender or the receiver does the copy).

>From these timelines, you can see that if OMPI's pipelining is good, the overall performance win of an individual send/receive of knem vs. no-knem is not that huge.

Huh. Disappointing. :-(

BUT.

Then I expanded my benchmarking to scale up the number of MPI processes on each server. *This* is where the real win is.

As you increase the number of MPI processes that are concurrently sending/receiving to/from each other, the "win" of knem becomes (much) more evident.

In short: doing 1 memcopy consumes half the memory bandwidth of 2 mem copies. So when you have lots of MPI processes competing for memory bandwidth, it turns out that having each MPI process use half the bandwidth is a Really Good Idea. :-) This allows more MPI processes to do shared memory communications before you hit the memory bandwidth bottleneck.

Darius Buntinas, Brice Goglin, et al. wrote an excellent paper about exactly this set of issues; see http://runtime.bordeaux.inria.fr/knem/. IIRC, it was the "Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis" paper (but that was only after a quick glance at the titles this morning -- it might not be exactly that paper).

On Jul 12, 2013, at 5:07 AM, Mark Dixon <m.c.dixon_at_[hidden]> wrote:

> Hi,
>
> I'm taking a look at knem, to see if it improves the performance of any applications on our QDR InfiniBand cluster, so I'm eager to hear about other people's experiences. This doesn't appear to have been discussed on this list before.
>
> I appreciate that any affect that knem will have is entirely dependent on the application, scale and input data, but:
>
> * Does anyone know of any examples of popular software packages that benefit particularly from the knem support in openmpi?
>
> * Has anyone noticed any downsides to using knem?
>
> Thanks,
>
> Mark
> --
> -----------------------------------------------------------------
> Mark Dixon Email : m.c.dixon_at_[hidden]
> HPC/Grid Systems Support Tel (int): 35429
> Information Systems Services Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/