Thanks for the suggestion. I've been thinking along those lines, but
it seems to have drawbacks. Consider the following MPI conversation:
Time NODE 1 NODE 2
0 local work local work
1 post n-b recv local work
2 local work post n-b send
3 complete recv in 1 local work
In an ideal implementation, NODE 1 would be able to go back to local
work immediately after posting a non blocking receive at t=1.
If using blocking message passing for the initial header, NODE 1 would
have to block at least until t=2, when NODE 2 sends the corresponding
message header. Node 1 can then go on doing local work while the main
message data is being transferred, but it still wastes 1 time unit
waiting for a message header to arrive.
Is there some clever way around this? Am I missing something?
On Thu, Jun 4, 2009 at 2:34 PM, Lars Andersson <larsand_at_[hidden]> wrote:
> Hi Lars
> I wonder if you could always use blocking message passing on the
> preliminary send/receive pair that transmits the message size/header,
> then use non-blocking mode for the actual message.
> If the "message size/header" part transmits a small buffer,
> the preliminary send/recv pair will use the "eager" communication mode,
> return quickly, and may not reduce performance, I would guess.
> For a group of several messages the preliminary
> send/recv pair could transmit a small (to ensure "eager mode")
> array of message sizes,
> maybe along with the message tags and sender ranks,
> instead of only one size.
> Just a thought.
> Gus Correa
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> Lars Andersson wrote:
>> I'm trying to solve a problem of passing serializable, arbitrarily
>> sized objects around using MPI and non-blocking communication. The
>> problem I'm facing is what to do at the receiving end when expecting
>> an object of unknown size, but at the same time not block on waiting
>> for it.
>> When using blocking message passing, I have simply solved the problem
>> by first sending a small, fixed size header containing the size of
>> rest of the data, sent in the following mpi message. When using
>> non-blocking message passing, this doesn't seem to be such a good
>> idea, since we cant post the main data transfer until we have received
>> the message header... It seems to take away most of the advantages on
>> non-blocking io in the first place.
>> I've been thinking about solving this using MPI_Probe / MPI_IProbe,
>> but I'm worried about performance.
>> Question 1:
>> Will MPI_Probe or the underlying MPI implementation actually receive
>> the full message data (assuming reasonably sized message, like less
>> than 10MB) before MPI_Probe returns? Or will there be a significant
>> data transfer delay (for large messages) when calling MPI_Recv after a
>> successful MPI_Probe?
>> What I want is something like this:
>> 1) post one or several non-blocking, variable sized message receives
>> 2) do other, non-MPI work, while any incoming messages will be fully
>> received into
>> buffers on the local machine.
>> 3) perform completion of the receives posted in 1). I don't want to
>> wait here for data transfers that could have taken place during 2).
>> I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
>> the sizes of incoming messages.
>> If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
>> I won't get nice compute/transfer overlap because the messages wont
>> actually be received locally until I post a Probe or Recv in 3.
>> Question 2:
>> How can I achieve the communication sequence described in 1,2,3 above,
>> with overlapping data transfer and local computation during 2?
>> Question 3:
>> A temporary kludge solution to the problem above might be to allocate
>> a temporary receive buffer of some arbitrary, constant maximum size
>> BUFSIZE in 1 for each non-blocking receive operation, make sure
>> messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
>> BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
>> actually correct and portable to receive less data than specified in
>> the count argument to MPI_Irecv.
>> What if the message sent on the other end is 10 bytes, and
>> BUFSIZE=count=20. Would that be OK?
>> If anyone can shed any light on this, I'd be grateful. FYI, we're
>> using a cluster of 2-8 core x86-64 machines running Linux and
>> connected using ordinary 1Gbit ethernet.
>> Best regards,
>> Lars Andersson
>> users mailing list