On Thu, Jun 4, 2009 at 2:54 PM, Lars Andersson <larsand_at_[hidden]> wrote:
> Hi Gus,
> Thanks for the suggestion. I've been thinking along those lines, but
> it seems to have drawbacks. Consider the following MPI conversation:
> Time NODE 1 NODE 2
> 0 local work local work
> 1 post n-b recv local work
> 2 local work post n-b send
> 3 complete recv in 1 local work
Sorry, that formatting didn't come out very well. Another attempt:
Time......NODE 1.......................NODE 2
0............local work....................local work
1............post n-b recv...............local work
2............local work....................post n-b send
3............complete recv in 1......local work
Hopefully you get the idea...
> In an ideal implementation, NODE 1 would be able to go back to local
> work immediately after posting a non blocking receive at t=1.
> If using blocking message passing for the initial header, NODE 1 would
> have to block at least until t=2, when NODE 2 sends the corresponding
> message header. Node 1 can then go on doing local work while the main
> message data is being transferred, but it still wastes 1 time unit
> waiting for a message header to arrive.
> Is there some clever way around this? Am I missing something?
> On Thu, Jun 4, 2009 at 2:34 PM, Lars Andersson <larsand_at_[hidden]> wrote:
>> Hi Lars
>> I wonder if you could always use blocking message passing on the
>> preliminary send/receive pair that transmits the message size/header,
>> then use non-blocking mode for the actual message.
>> If the "message size/header" part transmits a small buffer,
>> the preliminary send/recv pair will use the "eager" communication mode,
>> return quickly, and may not reduce performance, I would guess.
>> For a group of several messages the preliminary
>> send/recv pair could transmit a small (to ensure "eager mode")
>> array of message sizes,
>> maybe along with the message tags and sender ranks,
>> instead of only one size.
>> Just a thought.
>> Gus Correa
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> Lars Andersson wrote:
>>> I'm trying to solve a problem of passing serializable, arbitrarily
>>> sized objects around using MPI and non-blocking communication. The
>>> problem I'm facing is what to do at the receiving end when expecting
>>> an object of unknown size, but at the same time not block on waiting
>>> for it.
>>> When using blocking message passing, I have simply solved the problem
>>> by first sending a small, fixed size header containing the size of
>>> rest of the data, sent in the following mpi message. When using
>>> non-blocking message passing, this doesn't seem to be such a good
>>> idea, since we cant post the main data transfer until we have received
>>> the message header... It seems to take away most of the advantages on
>>> non-blocking io in the first place.
>>> I've been thinking about solving this using MPI_Probe / MPI_IProbe,
>>> but I'm worried about performance.
>>> Question 1:
>>> Will MPI_Probe or the underlying MPI implementation actually receive
>>> the full message data (assuming reasonably sized message, like less
>>> than 10MB) before MPI_Probe returns? Or will there be a significant
>>> data transfer delay (for large messages) when calling MPI_Recv after a
>>> successful MPI_Probe?
>>> What I want is something like this:
>>> 1) post one or several non-blocking, variable sized message receives
>>> 2) do other, non-MPI work, while any incoming messages will be fully
>>> received into
>>> buffers on the local machine.
>>> 3) perform completion of the receives posted in 1). I don't want to
>>> wait here for data transfers that could have taken place during 2).
>>> I can't post non-blocking MPI_Irecv() calls in 1, because I don't know
>>> the sizes of incoming messages.
>>> If I simply do nothing in 1, and call MPI_Probe in 3, I'm worried that
>>> I won't get nice compute/transfer overlap because the messages wont
>>> actually be received locally until I post a Probe or Recv in 3.
>>> Question 2:
>>> How can I achieve the communication sequence described in 1,2,3 above,
>>> with overlapping data transfer and local computation during 2?
>>> Question 3:
>>> A temporary kludge solution to the problem above might be to allocate
>>> a temporary receive buffer of some arbitrary, constant maximum size
>>> BUFSIZE in 1 for each non-blocking receive operation, make sure
>>> messages sent are not larger than BUFSIZE, and post MPI_Irecv(buffer,
>>> BUFSIZE,...) calls in 1. I haven't been able to figure out if it's
>>> actually correct and portable to receive less data than specified in
>>> the count argument to MPI_Irecv.
>>> What if the message sent on the other end is 10 bytes, and
>>> BUFSIZE=count=20. Would that be OK?
>>> If anyone can shed any light on this, I'd be grateful. FYI, we're
>>> using a cluster of 2-8 core x86-64 machines running Linux and
>>> connected using ordinary 1Gbit ethernet.
>>> Best regards,
>>> Lars Andersson
>>> users mailing list