On Mon, 2014-03-24 at 07:59 -0700, Ralph Castain wrote:
> I suspect the root cause of the problem here lies in how MPI messages are progressed. OMPI doesn't have an async progress method (yet), and so messaging on both send and recv ends is only progressed when the app calls the MPI library. It sounds like your app issues an isend or recv, and then spends a bunch of time computing before calling back into the MPI library again. If so, then the messaging can't progress during the time you are computing.
I switched to Send from ISend just for the messages that were hanging
up. This seems to perform well, which is a little surprising since the
original implementation that was all Send performed poorly.
I think the key is that the processes which mostly are busy computing
are doing the Send, while inbound messages arrive via Isend so that the
sender doesn't need to wait for the target to become available.
Thanks for your help.
> On Mar 22, 2014, at 2:44 PM, Bennet Fauber <bennet_at_[hidden]> wrote:
> > Hi, Ross,
> > Just out of curiosity, is Rmpi required for some package that you're
> > using? I only ask because, if you're mostly writing your own MPI
> > calls, you might want to look at pbdR/pbdMPI, if you haven't already.
> > They also have a pbdPROF for profiling and which should be able to do
> > some profiling with MPI.
> > http://rbigdata.github.io/packages.html
> > I wasn't sure whether this was really on topic for the list, so I send
> > it privately. Sorry for the extra noise if you've already eliminated
> > pdbR as a possibility.
> > -- bennet
> > On Sat, Mar 22, 2014 at 3:46 PM, Ross Boylan <ross_at_[hidden]> wrote:
> >> I have a bunch of simulators communicating results to a single
> >> assembler. The results seem to take a long time to be received, and the
> >> delay increases as the system runs. Here are some results:
> >> sent received delay
> >> 70.679 94.776 24.097
> >> 94.677 144.906 50.229
> >> 122.082 238.713 116.631
> >> 144.785 313.101 168.316
> >> 167.918 350.037 182.119
> >> 190.709 384.342 193.633
> >> Times are wall clock times in seconds since process launch, and so there
> >> may be some slew between sender and receiver, but it will be consistent
> >> (this tracks only sends from one simulator and also ignores later sends
> >> that never arrived--my completion logic needs work).
> >> The results are typically 500kB. Sending is via Isend (non-blocking)
> >> and receiving via Recv (blocking). The simulators spend most of their
> >> time computing; in particular there may be significant delays, e.g.,
> >> from 10 seconds to a minute, between calls to mpi (typically a mix of
> >> Isend, Recv, and Testsome). All processes are on the same machine (for
> >> now).
> >> The interval between assembler receives (from all sources) is sometimes
> >> quite brief, under 2 seconds, and the time between receives is quite
> >> variable. Neither is consistent with the theory that the receiver is
> >> saturated receiving messages, each of which takes a long time to
> >> transmit (I mean the active part of the transmission, when bits are
> >> flowing). I infer from this that actually transmitting the message does
> >> not take long, and that the longer gaps between receives have some other
> >> cause.
> >> This is all from R, and the problem might lie with higher level code.
> >> Can anyone explain what is going on, and what I might do to alleviate
> >> it?
> >> My speculation is that the necessary handshaking can only take place
> >> while both processes have called MPI, or perhaps some particular calls
> >> are required. The assembler spends most of its time executing a
> >> receive, but the simulators are mostly busy with other stuff. And so I
> >> suspect the delay is with the simulators, though I'm not sure what to do
> >> about it. I could wait on completion from the sender, but that kind of
> >> defeats the purpose of doing an isend.
> >> In a related thread about a similar issue, Jeff Squyres wrote
> >> (http://www.open-mpi.org/community/lists/users/2011/07/16928.php)
> >> ----------------------------------------------------
> >> If so, it's because Open MPI does not do background progress on
> >> non-blocking sends in all cases. Specifically, if you're sending over
> >> TCP and the message is "long", the OMPI layer in the master doesn't
> >> actually send the whole message immediately because it doesn't want to
> >> unexpectedly consume a lot of resources in the slave. So the master
> >> only sends a small fragment of the message and the communicator,tag
> >> tuple suitable for matching at the receiver. When the receiver posts a
> >> corresponding MPI_Recv (time=C), it sends back an ACK to the master,
> >> enabling the master to send the rest of the message.
> >> However, since OMPI doesn't support background progress in all
> >> situations, the master doesn't see this ACK until it goes into the MPI
> >> progression engine -- i.e., when you call MPI_Recv() at Time=E. Then
> >> the OMPI layer in the master sees the ACK and sends the rest of the
> >> message.
> >> ----------------------------------------------------------------
> >> I'm not sending over tcp (yet) but maybe I'm running into something
> >> similar.
> >> I had thought the MPI stuff was handled in separate layer or thread that
> >> would magically do all the work of moving messages around; the fact that
> >> top shows all the CPU going to the R processes suggests that's not the
> >> case.
> >> Running OMPI 1.7.4.
> >> Thanks for any help.
> >> Ross Boylan
> >> _______________________________________________
> >> users mailing list
> >> users_at_[hidden]
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> users mailing list