You would break the MPI_Irecv and MPI_Isend calls up into two parts:  MPI_Send_init and MPI_Recv_init in the first part and MPI_Start[all] in the second part.  The first part needs to be moved out of the subroutine... at least outside of the loop in sub1() and maybe even outside the 10000-iteration loop in the main program.  (There would also be MPI_Request_free calls that would similarly have to be moved out.)  If the overheads are small compared to the other work you're doing per message, the savings would be small.  (And, I'm guessing this is the case for you.)  Further, the code refactoring might not be simple.  So, persistent communications *might* not be a fruitful optimization strategy for you.  Just a warning.


Well! If I follow this strategy then the picture should be as follows. Correct??
Obviously the sub1 and sub2 exists outside separately. Following is just for understanding.


Main program starts------@@@@@@@@@@@@@@@@@@@@@@@.

CALL MPI_RECV_INIT for each neighboring process 
CALL MPI_SEND_INIT for each neighboring process

Loop Calling the subroutine1--------------------(10000 times in the main program).

Call subroutine1

Subroutine1 starts===================================
   Loop A starts here >>>>>>>>>>>>>>>>>>>> (three passes)
   Call subroutine2

   Subroutine2 starts----------------------------
         Pick local data from array U in separate arrays for each neighboring processor
         CALL MPI_STARTALL
         -------perform work that could be done with local data
         CALL MPI_WAITALL( )
         -------perform work using the received data
   Subroutine
2 ends----------------------------

         -------perform work to update array U
   Loop A ends here >>>>>>>>>>>>>>>>>>>>
Subroutine1 ends====================================

Loop Calling the subroutine1 ends------------(10000 times in the main program).

CALL MPI_Request_free( )

Main program ends------@@@@@@@@@@@@@@@@@@@@@@@.


But I think in the above case sending and receiving buffers would need to be create in GLOBAL Module , or need to be passed in the subroutine headers. In above there is one confusion. The sending buffer will be present in the argument list of the MPI_SEND_INIT() but it will get the values to be sent in the sub2? Is it possible/correct?

The question is that, will above actually be communication efficient and over-lapping communication-computation.  

best regards,
AA