Hi   Jeff S.
Thank you very much for your reply.
I am still feeling some confusion. Please guide.

 The idea is to do this:

   MPI_Recv_init()
   MPI_Send_init()
   for (i = 0; i < 1000; ++i) {
       MPI_Startall()
       /* do whatever */
       MPI_Waitall()
   }
   for (i = 0; i < 1000; ++i) {
       MPI_Request_free()
   }

So in your inner loop, you just call MPI_Startall() and a corresponding MPI_Test* / MPI_Wait* call to complete those requests.

The idea is that the MPI_*_init() functions do some one-time setup on the requests and then you just start and complete those same requests over and over and over.  When you're done, you free them.

Actually in my code what I was doing is:

CALL a subroutine-(1) 10000 times in the main program.

Subroutine-(1) starts===================================

   Loop A starts here >>>>>>>>>>>>>>>>>>>> (three passes)
   Call subroutine-(2)

   Subroutine-(2) starts----------------------------
         Pick local data from array U in separate arrays for each neighboring processor
         CALL MPI_IRECV for each neighboring process
         CALL MPI_ISEND for each neighboring process

         -------perform work that could be done with local data
         CALL MPI_WAITALL( )
         -------perform work using the received data
   Subroutine
-(2) ends----------------------------

         -------perform work to update array U
   Loop A ends here >>>>>>>>>>>>>>>>>>>>

Subroutine-(1) ends====================================

I assume that the above setup will overlap computation with communication (hiding communication behind computations), as well.
Now intention is to use persistent communication to get more efficiency. I am facing confusion how to use your proposed model for my work. Please suggest.

best regards,
AA.