Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] RFC: sm Latency
From: Richard Graham (rlgraham_at_[hidden])
Date: 2009-01-20 21:33:20

On 1/20/09 2:08 PM, "Eugene Loh" <Eugene.Loh_at_[hidden]> wrote:

> Richard Graham wrote:
>> Re: [OMPI devel] RFC: sm Latency First, the performance improvements look
>> really nice.
>> A few questions:
>> - How much of an abstraction violation does this introduce?
> Doesn't need to be much of an abstraction violation at all if, by that, we
> mean teaching the BTL about the match header. Just need to make some choices
> and I flagged that one for better visibility.
>>> >> I really don¹t see how teaching the btl about matching will help much (it
>>> will save a subroutine call). As I understand
>>> >> the proposal you aim to selectively pull items out of the fifo¹s ­ this
>>> will break the fifo¹s, as they assume contiguous
>>> >> entries. Logic to manage holes will need to be added.
>> This looks like the btl needs to start ³knowing² about MPI level semantics.
> That's one option. There are other options.
>>> >> Such as ?
>> Currently, the btl purposefully is ulp agnostic.
> What's ULP?
>>> >> Upper Level Protocol
>> I ask for 2 reasons
>> - you mention having the btl look at the match header (if I understood
>> correctly)
> Right, both to know if there is a match when the user had MPI_ANY_TAG and to
> extract values to populate the MPI_Status variable. There are other
> alternatives, like calling back the PML.
>> - not clear to me what you mean by returning the header to the list if
>> the irecv does not complete. If it does not complete, why not just pass the
>> header back for further processing, if all this is happening at the pml level
>> ?
> I was trying to read the FIFO to see what's on there. If it's something we
> can handle, we take it and handle it. If it's too complicated, then we just
> balk and tell the upper layer that we're declining any possible action. That
> just seemed to me to be the cleanest approach.
>>> >> see the note above. The fifo logic would have to be changed to manage
>>> non-contiguous entries.
> Here's an analogy. Let's say you have a house problem. You don't know how
> bad it is. You think you might have to hire an expensive contractor to do
> lots of work, but some local handyman thinks he can do it quickly and cheaply.
> So, you have the handyman look at the job to decide how involved it is. Let's
> say the handyman discovers that it is, indeed, a big job. How would you like
> things left at this point? Two options:
> *) Handyman says this is a big job. Bring in the expensive contractor and big
> equipment. I left everything as I found it. Or,
> *) Handyman says, "I took apart the this-and-this and I bought a bunch of
> supplies. I ripped out the south wall. The water to the house is turned off.
> Etc." You (and whoever has to come in to actually do the work) would probably
> prefer that nothing had been started.
> I thought it was cleaner to go the "do the whole job or don't do any of it"
> approach.
>> - The measurements seem to be very dual process specific. Have you looked
>> at the impact of these changes on other applications at the same process
>> count ? ³Real² apps would be interesting, but even hpl would be a good
>> start.
> Many measurements are for np=2. There are also np>2 HPCC pingpong results
> though. (HPCC pingpong measures pingpong between two processes while np-2
> process sit in wait loops.) HPCC also measures "ring" results... these are
> point-to-point with all np processes work.
> HPL is pretty insensitive to point-to-point performance. It either shows
> basically DGEMM performance or something is broken.
> I haven't looked at "real" apps.
> Let me be blunt about one thing: much of this is motivated by simplistic
> (HPCC) benchmarks. This is for two reasons:
> 1) These benchmarks are highly visible.
> 2) It's hard to tune real apps when you know the primitives need work.
> Looking at real apps is important and I'll try to get to that.
>>> >> don¹t disagree here at all. Just want to make sure that aiming at these
>>> important benchmarks does not
>>> >> harm what is really more important ­ the day to day use.
>> The current sm implementation is aimed only at small smp node count, which
>> was really the only relevant type of systems when this code was written 5
>> years ago. For large core counts there is a rather simple change that could
>> be put in that is simple to implement, and will give you flat scaling for the
>> sort of tests you are running. If you replace the fifo¹s with a single link
>> list per process in shared memory, with senders to this process adding match
>> envelopes atomically, with each process reading its own link list (multiple
>> writers and single reader in non-threaded situation) there will be only one
>> place to poll, regardless of the number of procs involved in the run. One
>> still needs other optimizations to lower the absolute latency ­ perhaps what
>> you have suggested. If one really has all N procs trying to write to the
>> same fifo at once, performance will stink because of contention, but most
>> apps don¹t have that behaviour.
> Okay. Yes, I am a fan of that approach. But:
> *) Doesn't strike me as a "simple" change.
>>> >> instead of a fifo_write (or what ever is is called), an entry is posted
>>> to the ³head² of a linked list, and the read is
>>> >> removing an entry from the list. If one cares about memory locality, you
>>> need to return things to the appropiate
>>> >> list, which is implicit in the fifo. More objects need to be allocated
>>> in shared memory.
> *) Not sure this addresses all-to-all well. E.g., let's say you post a
> receive for a particular source. Do you then wade through a long FIFO to look
> for your match?
>>> >> to pull things of the free list, you do need to look through what is on
>>> the queue. If it is not the match you are
>>> >> looking for, just post it the the appropriate local list for later use,
>>> just like the matching logic does now. As
>>> >> I mentioned this am, if you want, you don¹t have to have a single list
>>> per destination, you could have several lists,
>>> >> if you are concerned about too much contention.
> What the RFC talks about is not the last SM development we'll ever need. It's
> only supposed to be one step forward from where we are today. The "single
> queue per receiver" approach has many advantages, but I think it's a different
> topic.
>>> >> This is a big enough proposed change, that a call to describe this may be
>>> in place. I will state up front I am against
>>> >> introducing MPI semantics into the btl. Not against having that sort of
>>> option in the code base, but do want to
>>> >> preserve an option like the pml/btl abstraction.
> Rich