I've been trying to get overlapping computation and data transfer to
work, without much success so far. What I'm trying to achieve is:
* Post nonblocking send (30MB data)
1) Post nonblocking receive
2) do local work, while data is being received
3) complete transfer posted in 1) (MPI_Wait)
4) use received data
So, in my first test using a message size of 30MB, if I did nothing at
point 2) above, to complete the transfer in 3) takes about 0.8s.
In my second test, I simply put a sleep(3) at point 2), and expected
the MPI_Wait() call at 3) to finish almost instantly, since I assumed
that the message would have been transferred during the sleep. To my
disappointment tough, it took more or less the same time to finish the
MPI_Wait as without any sleep.
After browsing the forums, I realized that to make any communication
progress for these king of large messages, I usually need to block in
MPI_Wait, or repeatedly call MPI_Test. I guess that makes sense.
So, my questions is, how would you get around this and achieve optimal
Would you try to intersperse the local work code in 2) with calls to
MPI_Test() ? If yes, how frequent would these calls have to be made?
Another possible solution that comes to mind is to spawn a separate
thread that does an MPI_Wait(). With Open MPI over Ethernet, would
that mean that the MPI_Wait thread would busy-loop, and thus steal up
to 50% of the CPU from the main thread doing the local computation
Lots of questions, but I think this is a pretty common scenario.
Still, after a lot of browsing, I haven't been able to find any