Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] MPI Persistent Communication Question
From: Eugene Loh (eugene.loh_at_[hidden])
Date: 2010-06-28 15:33:51

amjad ali wrote:
You would break the MPI_Irecv and MPI_Isend calls up into two parts:  MPI_Send_init and MPI_Recv_init in the first part and MPI_Start[all] in the second part.  The first part needs to be moved out of the subroutine... at least outside of the loop in sub1() and maybe even outside the 10000-iteration loop in the main program.  (There would also be MPI_Request_free calls that would similarly have to be moved out.)  If the overheads are small compared to the other work you're doing per message, the savings would be small.  (And, I'm guessing this is the case for you.)  Further, the code refactoring might not be simple.  So, persistent communications *might* not be a fruitful optimization strategy for you.  Just a warning.

Well! If I follow this strategy then the picture should be as follows. Correct??
Yes, I think that's right.
Obviously the sub1 and sub2 exists outside separately. Following is just for understanding.

Main program starts------@@@@@@@@@@@@@@@@@@@@@@@.

CALL MPI_RECV_INIT for each neighboring process 
CALL MPI_SEND_INIT for each neighboring process

Loop Calling the subroutine1--------------------(10000 times in the main program).

Call subroutine1

Subroutine1 starts===================================
   Loop A starts here >>>>>>>>>>>>>>>>>>>> (three passes)
   Call subroutine2

   Subroutine2 starts----------------------------
         Pick local data from array U in separate arrays for each neighboring processor
         -------perform work that could be done with local data
         CALL MPI_WAITALL( )
         -------perform work using the received data
2 ends----------------------------

         -------perform work to update array U
   Loop A ends here >>>>>>>>>>>>>>>>>>>>
Subroutine1 ends====================================

Loop Calling the subroutine1 ends------------(10000 times in the main program).

CALL MPI_Request_free( )

Main program ends------@@@@@@@@@@@@@@@@@@@@@@@.

But I think in the above case sending and receiving buffers would need to be create in GLOBAL Module , or need to be passed in the subroutine headers.
Right.  The buffer information is needed both outside of all the loops (in MAIN, where the persistent channels are created) and in the innermost loop (in subroutine 2, where the buffers are loaded and used).
In above there is one confusion. The sending buffer will be present in the argument list of the MPI_SEND_INIT() but it will get the values to be sent in the sub2? Is it possible/correct?
Yes.  The buffer needs to be used by the user program to set the send message up and to use the data that has been received.  The buffer also needs to be specified to the MPI implementation so that MPI knows which buffers to send/receive.  With a persistent communication, you specify the buffer in the "init" call and thereafter refer to it opaquely with the "request" handle.  Incidentally, this can cause problems for optimizing compilers, which may not recognize there is a relationship between a buffer and the opaque request handle.  Consider the "extreme possibility" described in
The question is that, will above actually be communication efficient and over-lapping communication-computation.  
There are two issues, I think.

One is whether persistent communications will help you reduce overheads.  It depends, but if for each message you do a bunch of work (packing buffers, computing on data, or even just having lost of data per message), then the amount of overhead you're saving may be relatively small.

Another is whether you can overlap communications and computation.  This does not require persistent channels, but only nonblocking communications (MPI_Isend/MPI_Irecv).  Again, there are no MPI guarantees here, so you may have to break your computation up and insert MPI_Test calls.

You may want to get the basic functionality working first and then run performance experiments to decide whether these really are areas that warrant such optimizations.