Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] trouble using --mca mpi_yield_when_idle 1
From: douglas.guptill_at_[hidden]
Date: 2008-12-18 14:37:06


Hello Jeff, Eugene:

On Fri, Dec 12, 2008 at 04:47:11PM -0500, Jeff Squyres wrote:

...<snip>...

> The "P" is MPI's profiling interface. See chapter 14 in the MPI-2.1
> doc.

Ah...Thank you, both Jeff and Eugene, for pointing that out.

I think there is a typo in chapter 14 - the first sentence isn't a
sentence - but that's another story.

> >Based on my re-read of the MPI standard, it appears that I may have
> >slightly mis-stated my issue. The spin is probably taking place in
> >"mpi_send". "mpi_send", according to my understanding of the MPI
> >standard, may not exit until a matching "mpi_recv" has been initiated,
> >or completed. At least that is the conclusion I came to.
>
> Perhaps something like this:
>
> int MPI_Send(...) {
> MPI_Request req;
> int flag;
> PMPI_Isend(..., &req);
> do {
> nanosleep(short);
> PMPI_Test(&req, &flag, MPI_STATUS_IGNORE);
> } while (!flag);
> }
>
> That is, *you* provide MPI_Send and intercept all your apps calls to
> MPI_Send. But you implement it by doing a non-blocking send and
> sleeping and polling MPI to know when it's done. Of course, you don't
> have to implement this as MPI_Send -- you could always have
> your_func_prefix_send(...) instead of explicitly using the MPI
> profiling interface. But using the profiling interface allows you to
> swap in/out different implementations of MPI_Send (etc.) at link time,
> if that's desirable to you.
>
> Looping over sleep/test is not the most efficient way of doing it, but
> it may be suitable for your purposes.

Indeed, it is very suitable. Thank you, both Jeff and Eugene, for
pointing the way. That solution changes the load for my job from 2.0
to 1.0, as indicated by "xload" over a 40-minute run.

That means I can *double* the throughput of my machine.

Some gory details:

I ignored the suggestion to use MPI_STATUS_IGNORE, and that got me
some trouble, as you may not be surprised to hear. The solution was
to use MPI_Request_get_status instead of MPI_Test.

As some of my waits (both in MPI_SEND and MPI_RECV) will be very
short, and some will be up to 4 minutes, I implemented a graduated
sleep time; it starts at 1 millisecond, and doubles after each sleep
up to a maximum of 100 milliseconds. Interestingly, when I left the
sleep time at a constant 1 millisecond, the run load went up
significantly; it varied over the range 1.3 -> 1.7 .

I have attached my MPI_Send.c and MPI_Recv.c . Comments welcome and
appreciated.

Regards,
Douglas.

-- 
  Douglas Guptill                       
  Research Assistant, LSC 4640          email: douglas.guptill_at_[hidden]
  Oceanography Department               fax:   902-494-3877
  Dalhousie University
  Halifax, NS, B3H 4J1, Canada