Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] sm BTL flow management
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-06-25 03:06:23


Bryan Lally wrote:

> Ralph Castain wrote:
>
>> Be happy to put it through the wringer... :-)
>
> My wringer is available, too.

'kay. Try

hg clone ssh://www.open-mpi.org/~eloh/hg/pending_sends

which is r21498 but with changes to poll one's own FIFO more regularly
(e.g., even when just performing sends) and to retry pending sends more
aggressively (e.g., whenever about to try a send or whenever one calls
sm progress). I maintain a count of outstanding fragments (sent but not
yet returned to free list) and of pending sends (total over all queues)
to keep overheads down.

My various test codes (repeated Bcasts, half-duplex point-to-point
sends, etc.) all pass now. There is no perceptible degradation in
0-byte pingpong latency that I can tell. George's fixed-free-list
proposal may be better, but I'm making these bits available for some
soak and feedback.

Life is still not perfect. If you look in
mca_btl_sm_component_progress, when a process receives a message
fragment and returns it to the sender, it executes code like this:

     goto recheck_peer;
     break;

Okay, the reason I show you that code is because a static code checker
should easily identify the break statement as dead code. It'll never be
reached. Anyhow, in English, what's happening is if you receive a
message fragment, you keep polling your FIFO. So, consider the case of
half-duplex point-to-point traffic: one process only sends and the
other process only receives. Previously, this would eventually hang.
Now, it won't. But (I haven't confirmed 100% yet), I don't think it
executes very pleasantly. E.g., if you have

     for ( i = 0; i < N; i++ ) {
          if ( me == 0 ) MPI_Send(...);
          if ( me == 1 ) MPI_Recv(...);
     }

At some point, the receiver falls hopelessly behind. The sender keeps
pumping messages and the receiver keeps polling its FIFO, pulling in
messages and returning fragments to the sender so that the sender can
keep on going. Problem is, all that is happening within one MPI_Recv
call... which in a test code might be pulling in 100Ks of messages. The
MPI_Recv call won't return until the sender lets up. Then, the rest of
the MPI_Recv calls will execute, all pulling messages out of the local
unexpected-message queue.

Not sure yet how I want to manage this. The bottom line might be that
if the MPI application has no flow control, the underlying MPI
implementation is going to have to do something that won't make everyone
happy. Oh well. At least the program makes progress and completes in
reason time.