Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] sm BTL flow management
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-06-25 16:10:29


Eugene Loh wrote:

> If you look in mca_btl_sm_component_progress, when a process receives
> a message fragment and returns it to the sender, it executes code like
> this:
>
> goto recheck_peer;
> break;
>
> Okay, the reason I show you that code is because a static code checker
> should easily identify the break statement as dead code. It'll never
> be reached. Anyhow, in English, what's happening is if you receive a
> message fragment, you keep polling your FIFO. So, consider the case
> of half-duplex point-to-point traffic: one process only sends and the
> other process only receives. Previously, this would eventually hang.
> Now, it won't. But (I haven't confirmed 100% yet), I don't think it
> executes very pleasantly. E.g., if you have
>
> for ( i = 0; i < N; i++ ) {
> if ( me == 0 ) MPI_Send(...);
> if ( me == 1 ) MPI_Recv(...);
> }
>
> At some point, the receiver falls hopelessly behind. The sender keeps
> pumping messages and the receiver keeps polling its FIFO, pulling in
> messages and returning fragments to the sender so that the sender can
> keep on going. Problem is, all that is happening within one MPI_Recv
> call... which in a test code might be pulling in 100Ks of messages.
> The MPI_Recv call won't return until the sender lets up. Then, the
> rest of the MPI_Recv calls will execute, all pulling messages out of
> the local unexpected-message queue.
>
> Not sure yet how I want to manage this. The bottom line might be that
> if the MPI application has no flow control, the underlying MPI
> implementation is going to have to do something that won't make
> everyone happy. Oh well. At least the program makes progress and
> completes in reason time.

I spoke with Brian and Jeff about this earlier today. Presumably, up
through 1.2, mca_btl_component_progress would poll and if it received a
message fragment would return. Then, presumably in 1.3.0, behavior was
changed to keep polling until the FIFO was empty. Brian said this was
based on Terry's desire to keep latency as low as possible in
benchmarks. Namely, reaching down into a progress call was a long code
path. It would be better to pick up multiple messages, if available on
the FIFO, and queue extras up in the unexpected queue. Then, a
subsequent call could more efficiently find the anticipated message
fragment.

I don't see how the behavior would impact short-message pingpongs (the
typical way to measure latency) one way or the other.

I asked Terry, who struggled to remember the issue and pointed me at
this thread:
http://www.open-mpi.org/community/lists/devel/2008/06/4158.php . But
that is related to an issue that's solved if one keeps polling as long
as one gets ACKs (but returns as soon as a real message fragment is found).

Can anyone shed some light on the history here? Why keep polling even
when a message fragment has been found? The downside of polling too
aggressively is that the unexpected queue can grow (without bounds).

Brian's proposal is to set some variable that determines how many
message fragments a single mca_btl_sm_component_progress call can drain
from the FIFO before returning.

Thanks for any discussion, insight, or historical recollections.