Yes, you're seeing more-or-less the expected behavior. It's a complicated issue.
Short version: you might want to sprinkle MPI_Test's throughout your compute stage to get true overlap.
More detail: MPI's typically use a "rendezvous" protocol for large messages, meaning that it sends a small fragment to the peer announcing the communicator,tag,peer of the source of the message. When the receiver actually posts a matching receive, it sends back an ACK to the sender saying, "Ok, I have the buffer available now -- send the rest of the message".
So when you initiate a large send, the receiver still has to match that short initial frag, send back the ACK, and then the sender has to send the rest of the message. I.e., the MPI layer has to be involved on both sides a few more times. With a single-threaded MPI implementation like Open MPI, this means you need to dip into the MPI layer to keep the progress going.
This is currently even true with RDMA/hardware offload technologies. So even though the bulk of the message transfer is offloaded to the NIC hardware, OMPI won't even initiate that bulk transfer until the ACK has been received.
In a perfect MPI implementation, you can do exactly what you said -- MPI_Isend a large message and eventually an MPI_Wait, and the MPI_Wait basically does very little except notice that the transfer is already done.
However, this is engineering/reality -- there's always a tradeoff.
You can, for example, increase OMPI's threshhold between "small" and "large" and consider everything to be a "small" message -- meaning that they would be sent eagerly, and not via a rendezvous protocol (and therefore you have a much better changes of MPI_Isend/MPI_Wait doing more of what you expect). But this tends to consume more buffering at the receiver.
On Mar 7, 2014, at 9:49 AM, Velickovic Nikola <nikola.velickovic_at_[hidden]> wrote:
> Dear all,
> I have a simple MPI program with two processes using non-blocking communication illustrated bellow:
> process 0: process 1:
> MPI_Isend MPI_Irecv
> compute stage compute stage
> MPI_Wait MPI_Wait
> Actual communication is performed by offloading it to another thread, or using DMA (KNEM module is used for this).
> Ideally what should happen is that process 0 issues a non-blocking send, process 1 receives the data
> and in the meantime (in parallel) the CPU cores where the processes run are doing the compute stage.
> When compute stage is completed calling MPI_Wait wraps up the communication.
> When I profile my application it turns out that actual communication is initiated with MPI_Wait (significant amount of time is spent there) and hence disables overlapping
> communication and computation since MPI_Wait is called after the compute stage.
> Computation in my test case takes more time than communication so MPI_Wait should not be consuming significant amount of time since the communication should be over by then.
> This I confirmed also by using MPI_Test instead of MPI_Wait.
> MPI_Test has the same effect as MPI_Wait (to the best of my knowledge) but is non-blocking.
> When placing MPI_Test strategically in the compute stage it initiates the communication and a certain communication-computation overlap is achieved.
> Could you please shed some light for me if I am doing something wrong with the library?
> Is it the way it should behave (MPI_Wait initiates the actual transfer)?
> How to achieve communication-computation overlap?
> users mailing list
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/