Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] efficient strategy with temporary message copy
From: Jeff Squyres (jsquyres) (jsquyres_at_[hidden])
Date: 2014-03-17 11:31:46

On Mar 16, 2014, at 10:24 PM, christophe petit <christophe.petit09_at_[hidden]> wrote:

> I am studying the optimization strategy when the number of communication functions in a codeis high.
> My courses on MPI say two things for optimization which are contradictory :
> 1*) You have to use temporary message copy to allow non-blocking sending and uncouple the sending and receiving

There's a lot of schools of thought here, and the real answer is going to depend on your application.

If the message is "short" (and the exact definition of "short" depends on your platform -- it varies depending on your CPU, your memory, your CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce buffer is typically a good idea. That lets you keep using your "real" buffer and not have to wait until communication is done.

For "long" messages, the equation is a bit different. If "long" isn't "enormous", you might be able to have N buffers available, and simply work on 1 of them at a time in your main application and use the others for ongoing non-blocking communication. This is sometimes called "shadow" copies, or "ghost" copies.

Such shadow copies are most useful when you receive something each iteration, for example. For example, something like this:

  buffer[0] = malloc(...);
  buffer[1] = malloc(...);
  current = 0;
  while (still_doing_iterations) {
      MPI_Irecv(buffer[current], ..., &req);
      /// work on buffer[current - 1]
      MPI_Wait(req, MPI_STATUS_IGNORE);
      current = 1 - current;

You get the idea.

> 2*) Avoid using temporary message copy because the copy will add extra cost on execution time.

It will, if the memcpy cost is significant (especially compared to the network time to send it). If the memcpy is small/insignificant, then don't worry about it.

You'll need to determine where this crossover point is, however.

Also keep in mind that MPI and/or the underlying network stack will likely be doing these kinds of things under the covers for you. Indeed, if you send short messages -- even via MPI_SEND -- it may return "immediately", indicating that MPI says it's safe for you to use the send buffer. But that doesn't mean that the message has even actually left the current server and gone out onto the network yet (i.e., some other layer below you may have just done a memcpy because it was a short message, and the processing/sending of that message is still ongoing).

> And then, we are adviced to do :
> - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is said that execution is divided by a factor 2

This very, very much depends on your application.

MPI_SSEND won't return until the receiver has started to receive the message.

For some communication patterns, putting in this additional level of synchronization is helpful -- it keeps all MPI processes in tighter synchronization and you might experience less jitter, etc. And therefore overall execution time is faster.

But for others, it adds unnecessary delay.

I'd say it's an over-generalization that simply replacing MPI_SEND with MPI_SSEND always reduces execution time by 2.

> - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize (synchroneous non-blocking sending) : it is said that execution is divided by a factor 3

Again, it depends on the app. Generally, non-blocking communication is better -- *if your app can effectively overlap communication and computation*.

If your app doesn't take advantage of this overlap, then you won't see such performance benefits. For example:

   MPI_Isend(buffer, ..., req);
   MPI_Wait(&req, ...);

Technically, the above uses ISEND and WAIT... but it's actually probably going to be *slower* than using MPI_SEND because you've made multiple function calls with no additional work between the two -- so the app didn't effectively overlap the communication with any local computation. Hence: no performance benefit.

> So what's the best optimization ? Do we have to use temporary message copy or not and if yes, what's the case for ?

As you can probably see from my text above, the answer is: it depends. :-)

Jeff Squyres
For corporate legal information go to: