On 1/20/09 2:08 PM, "Eugene Loh" <Eugene.Loh_at_[hidden]> wrote:
> Richard Graham wrote:
>> Re: [OMPI devel] RFC: sm Latency First, the performance improvements look
>> really nice.
>> A few questions:
>> - How much of an abstraction violation does this introduce?
> Doesn't need to be much of an abstraction violation at all if, by that, we
> mean teaching the BTL about the match header. Just need to make some choices
> and I flagged that one for better visibility.
>>> >> I really don¹t see how teaching the btl about matching will help much (it
>>> will save a subroutine call). As I understand
>>> >> the proposal you aim to selectively pull items out of the fifo¹s this
>>> will break the fifo¹s, as they assume contiguous
>>> >> entries. Logic to manage holes will need to be added.
>> This looks like the btl needs to start ³knowing² about MPI level semantics.
> That's one option. There are other options.
>>> >> Such as ?
>> Currently, the btl purposefully is ulp agnostic.
> What's ULP?
>>> >> Upper Level Protocol
>> I ask for 2 reasons
>> - you mention having the btl look at the match header (if I understood
> Right, both to know if there is a match when the user had MPI_ANY_TAG and to
> extract values to populate the MPI_Status variable. There are other
> alternatives, like calling back the PML.
>> - not clear to me what you mean by returning the header to the list if
>> the irecv does not complete. If it does not complete, why not just pass the
>> header back for further processing, if all this is happening at the pml level
> I was trying to read the FIFO to see what's on there. If it's something we
> can handle, we take it and handle it. If it's too complicated, then we just
> balk and tell the upper layer that we're declining any possible action. That
> just seemed to me to be the cleanest approach.
>>> >> see the note above. The fifo logic would have to be changed to manage
>>> non-contiguous entries.
> Here's an analogy. Let's say you have a house problem. You don't know how
> bad it is. You think you might have to hire an expensive contractor to do
> lots of work, but some local handyman thinks he can do it quickly and cheaply.
> So, you have the handyman look at the job to decide how involved it is. Let's
> say the handyman discovers that it is, indeed, a big job. How would you like
> things left at this point? Two options:
> *) Handyman says this is a big job. Bring in the expensive contractor and big
> equipment. I left everything as I found it. Or,
> *) Handyman says, "I took apart the this-and-this and I bought a bunch of
> supplies. I ripped out the south wall. The water to the house is turned off.
> Etc." You (and whoever has to come in to actually do the work) would probably
> prefer that nothing had been started.
> I thought it was cleaner to go the "do the whole job or don't do any of it"
>> - The measurements seem to be very dual process specific. Have you looked
>> at the impact of these changes on other applications at the same process
>> count ? ³Real² apps would be interesting, but even hpl would be a good
> Many measurements are for np=2. There are also np>2 HPCC pingpong results
> though. (HPCC pingpong measures pingpong between two processes while np-2
> process sit in wait loops.) HPCC also measures "ring" results... these are
> point-to-point with all np processes work.
> HPL is pretty insensitive to point-to-point performance. It either shows
> basically DGEMM performance or something is broken.
> I haven't looked at "real" apps.
> Let me be blunt about one thing: much of this is motivated by simplistic
> (HPCC) benchmarks. This is for two reasons:
> 1) These benchmarks are highly visible.
> 2) It's hard to tune real apps when you know the primitives need work.
> Looking at real apps is important and I'll try to get to that.
>>> >> don¹t disagree here at all. Just want to make sure that aiming at these
>>> important benchmarks does not
>>> >> harm what is really more important the day to day use.
>> The current sm implementation is aimed only at small smp node count, which
>> was really the only relevant type of systems when this code was written 5
>> years ago. For large core counts there is a rather simple change that could
>> be put in that is simple to implement, and will give you flat scaling for the
>> sort of tests you are running. If you replace the fifo¹s with a single link
>> list per process in shared memory, with senders to this process adding match
>> envelopes atomically, with each process reading its own link list (multiple
>> writers and single reader in non-threaded situation) there will be only one
>> place to poll, regardless of the number of procs involved in the run. One
>> still needs other optimizations to lower the absolute latency perhaps what
>> you have suggested. If one really has all N procs trying to write to the
>> same fifo at once, performance will stink because of contention, but most
>> apps don¹t have that behaviour.
> Okay. Yes, I am a fan of that approach. But:
> *) Doesn't strike me as a "simple" change.
>>> >> instead of a fifo_write (or what ever is is called), an entry is posted
>>> to the ³head² of a linked list, and the read is
>>> >> removing an entry from the list. If one cares about memory locality, you
>>> need to return things to the appropiate
>>> >> list, which is implicit in the fifo. More objects need to be allocated
>>> in shared memory.
> *) Not sure this addresses all-to-all well. E.g., let's say you post a
> receive for a particular source. Do you then wade through a long FIFO to look
> for your match?
>>> >> to pull things of the free list, you do need to look through what is on
>>> the queue. If it is not the match you are
>>> >> looking for, just post it the the appropriate local list for later use,
>>> just like the matching logic does now. As
>>> >> I mentioned this am, if you want, you don¹t have to have a single list
>>> per destination, you could have several lists,
>>> >> if you are concerned about too much contention.
> What the RFC talks about is not the last SM development we'll ever need. It's
> only supposed to be one step forward from where we are today. The "single
> queue per receiver" approach has many advantages, but I think it's a different
>>> >> This is a big enough proposed change, that a call to describe this may be
>>> in place. I will state up front I am against
>>> >> introducing MPI semantics into the btl. Not against having that sort of
>>> option in the code base, but do want to
>>> >> preserve an option like the pml/btl abstraction.