Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: sm Latency
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-01-21 01:35:01


Richard Graham wrote:
Re: [OMPI devel] RFC: sm Latency On 1/20/09 2:08 PM, "Eugene Loh" <Eugene.Loh@sun.com> wrote:
Richard Graham wrote:
Re: [OMPI devel] RFC: sm Latency First, the performance improvements look really nice.
A few questions:
  - How much of an abstraction violation does this introduce?
Doesn't need to be much of an abstraction violation at all if, by that, we mean teaching the BTL about the match header.  Just need to make some choices and I flagged that one for better visibility.

>> I really don’t see how teaching the btl about matching will help much (it will save a subroutine call).  As I understand
>> the proposal you aim to selectively pull items out of the fifo’s – this will break the fifo’s, as they assume contiguous
>> entries.  Logic to manage holes will need to be added.
No.  It's still a FIFO.  You look at the tail of the FIFO.  If you can handle what you see there, you pop that item off and handle it.  If you can't, you punt and return control to the ULP, who handles things the traditional (and heavier-weight) method.  If the item of interest isn't at the tail, you won't see it.
This looks like the btl needs to start “knowing” about MPI level semantics.
That's one option.  There are other options.

>> Such as ?
PML callback.  Jeff's question about how much performance (if any) one loses with callback is a good one.  If I were less lazy (and had more infinite time), I would have tested that before sending out the RFC.  As it was, I wanted to see how much pushback there would be on the "abstract violation" issue.  Enough, it turns out, to try the experiment.  I'll try to test it out and report back.
If you replace the fifo’s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple writers and single reader in non-threaded situation) there will be only one place to poll, regardless of the number of procs involved in the run.
*) Doesn't strike me as a "simple" change.
Let me be clear that I can see many benefits to this approach and don't think it's prohibitively hard.  So, I'm not trying to shoot this approach down entirely.  I do have the proposed approach implemented, though, and it seems like a smaller change in behavior from what we have today, and many of the optimizations are unrelated to polling (and hence to the "single queue" proposal).