From: Eugene Loh <Eugene.Loh@sun.com>
To: Open MPI Users <users@open-mpi.org>
Sent: Thursday, 6 November, 2008 18:08:26
Subject: Re: [OMPI users] Progress of the asynchronous messages

vladimir marjanovic wrote:
I am new user of Open MPI, I've used MPICH before.

There is performance bug with the following scenario:

proc_B:  MPI_Isend(...,proc_A,..,&request)
                do{
                  sleep(1);
                  MPI_Test(..,&flag,&request);
                  count++
                }while(!flag);

proc_A: MPI_Recv(...,proc_B);

For message size 8MB,  proc_B calls MPI_Test 88 times. It means that point to point communication costs 88 seconds.
Btw, bandwidth isn't the problem (interconnection network: InfiniBand)

Obviously, there is the problem with progress of the asynchronous messages.

How can I avoid this problem?
I'm no expert, but I think the problem is that the send is being "progressed" (advanced) only during MPI calls and MPI_Test doesn't progress/advance the message very aggressively.  The message is probably being decomposed into chunks and MPI_Test will advance the message at most one chunk at a time.  So:

1) You could decrease the time between MPI_Test calls.
2) You could block (e.g., with MPI_Wait).

It's a tough tradeoff to make.  That's bad news... but do you want OMPI to be making the tough choices here for you?  Let's say the sending process sends a chunk and it takes a little while for the receiver to process data and make room for you to send some more.  During that waiting time, should the sender return control to the user application, or stay blocked inside of MPI_Test?

Anyhow, I believe that's the issue here.

In order to overlap communication and computation I don't want to use MPI_Wait. For sure the message is being decomposed into chucks and the size of chuck is probably defined by environment variable.
Maybe do you know how can I control size of chuck?
Thanks

Vladimir