Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: sm Latency
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-01-20 14:08:24

Richard Graham wrote:
Re: [OMPI devel] RFC: sm Latency First, the performance improvements look really nice.
A few questions:
  - How much of an abstraction violation does this introduce?
Doesn't need to be much of an abstraction violation at all if, by that, we mean teaching the BTL about the match header.  Just need to make some choices and I flagged that one for better visibility.
This looks like the btl needs to start ¡§knowing¡¨ about MPI level semantics.
That's one option.  There are other options.
Currently, the btl purposefully is ulp agnostic.
What's ULP?
I ask for 2 reasons
       - you mention having the btl look at the match header (if I understood correctly)
Right, both to know if there is a match when the user had MPI_ANY_TAG and to extract values to populate the MPI_Status variable.  There are other alternatives, like calling back the PML.
       - not clear to me what you mean by returning the header to the list if the irecv does not complete.  If it does not complete, why not just pass the header back for further processing, if all this is happening at the pml level ?
I was trying to read the FIFO to see what's on there.  If it's something we can handle, we take it and handle it.  If it's too complicated, then we just balk and tell the upper layer that we're declining any possible action.  That just seemed to me to be the cleanest approach.

Here's an analogy.  Let's say you have a house problem.  You don't know how bad it is.  You think you might have to hire an expensive contractor to do lots of work, but some local handyman thinks he can do it quickly and cheaply.  So, you have the handyman look at the job to decide how involved it is.  Let's say the handyman discovers that it is, indeed, a big job.  How would you like things left at this point?  Two options:

*) Handyman says this is a big job.  Bring in the expensive contractor and big equipment.  I left everything as I found it.  Or,

*) Handyman says, "I took apart the this-and-this and I bought a bunch of supplies.  I ripped out the south wall.  The water to the house is turned off.  Etc."  You (and whoever has to come in to actually do the work) would probably prefer that nothing had been started.

I thought it was cleaner to go the "do the whole job or don't do any of it" approach.
  - The measurements seem to be very dual process specific.  Have you looked at the impact of these changes on other applications at the same process count ?  ¡§Real¡¨ apps would be interesting, but even hpl would be a good start.
Many measurements are for np=2.  There are also np>2 HPCC pingpong results though.  (HPCC pingpong measures pingpong between two processes while np-2 process sit in wait loops.)  HPCC also measures "ring" results... these are point-to-point with all np processes work.

HPL is pretty insensitive to point-to-point performance.  It either shows basically DGEMM performance or something is broken.

I haven't looked at "real" apps.

Let me be blunt about one thing:  much of this is motivated by simplistic (HPCC) benchmarks.  This is for two reasons:

1) These benchmarks are highly visible.
2) It's hard to tune real apps when you know the primitives need work.

Looking at real apps is important and I'll try to get to that.
  The current sm implementation is aimed only at small smp node count, which was really the only relevant type of systems when this code was written 5 years ago.  For large core counts there is a rather simple change that could be put in that is simple to implement, and will give you flat scaling for the sort of tests you are running.  If you replace the fifo¡¦s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple writers and single reader in non-threaded situation) there will be only one place to poll, regardless of the number of procs involved in the run.  One still needs other optimizations to lower the absolute latency ¡V perhaps what you have suggested.  If one really has all N procs trying to write to the same fifo at once, performance will stink because of contention, but most apps don¡¦t have that behaviour.
Okay.  Yes, I am a fan of that approach.  But:

*) Doesn't strike me as a "simple" change.
*) Not sure this addresses all-to-all well.  E.g., let's say you post a receive for a particular source.  Do you then wade through a long FIFO to look for your match?

What the RFC talks about is not the last SM development we'll ever need.  It's only supposed to be one step forward from where we are today.  The "single queue per receiver" approach has many advantages, but I think it's a different topic.