On 1/20/09 2:08 PM, "Eugene Loh" <Eugene.Loh@sun.com> wrote:
Richard Graham wrote:
Re: [OMPI devel] RFC: sm Latency First, the performance improvements look really nice.Doesn't need to be much of an abstraction violation at all if, by that, we mean teaching the BTL about the match header. Just need to make some choices and I flagged that one for better visibility.
A few questions:
- How much of an abstraction violation does this introduce?
>> I really don’t see how teaching the btl about matching will help much (it will save a subroutine call). As I understand
>> the proposal you aim to selectively pull items out of the fifo’s – this will break the fifo’s, as they assume contiguous
>> entries. Logic to manage holes will need to be added.
This looks like the btl needs to start “knowing” about MPI level semantics.That's one option. There are other options.
>> Such as ?
Currently, the btl purposefully is ulp agnostic.What's ULP?
>> Upper Level Protocol
I ask for 2 reasonsRight, both to know if there is a match when the user had MPI_ANY_TAG and to extract values to populate the MPI_Status variable. There are other alternatives, like calling back the PML.
- you mention having the btl look at the match header (if I understood correctly)
- not clear to me what you mean by returning the header to the list if the irecv does not complete. If it does not complete, why not just pass the header back for further processing, if all this is happening at the pml level ?I was trying to read the FIFO to see what's on there. If it's something we can handle, we take it and handle it. If it's too complicated, then we just balk and tell the upper layer that we're declining any possible action. That just seemed to me to be the cleanest approach.
>> see the note above. The fifo logic would have to be changed to manage non-contiguous entries.
Here's an analogy. Let's say you have a house problem. You don't know how bad it is. You think you might have to hire an expensive contractor to do lots of work, but some local handyman thinks he can do it quickly and cheaply. So, you have the handyman look at the job to decide how involved it is. Let's say the handyman discovers that it is, indeed, a big job. How would you like things left at this point? Two options:
*) Handyman says this is a big job. Bring in the expensive contractor and big equipment. I left everything as I found it. Or,
*) Handyman says, "I took apart the this-and-this and I bought a bunch of supplies. I ripped out the south wall. The water to the house is turned off. Etc." You (and whoever has to come in to actually do the work) would probably prefer that nothing had been started.
I thought it was cleaner to go the "do the whole job or don't do any of it" approach.
- The measurements seem to be very dual process specific. Have you looked at the impact of these changes on other applications at the same process count ? “Real” apps would be interesting, but even hpl would be a good start. Many measurements are for np=2. There are also np>2 HPCC pingpong results though. (HPCC pingpong measures pingpong between two processes while np-2 process sit in wait loops.) HPCC also measures "ring" results... these are point-to-point with all np processes work.
HPL is pretty insensitive to point-to-point performance. It either shows basically DGEMM performance or something is broken.
I haven't looked at "real" apps.
Let me be blunt about one thing: much of this is motivated by simplistic (HPCC) benchmarks. This is for two reasons:
1) These benchmarks are highly visible.
2) It's hard to tune real apps when you know the primitives need work.
Looking at real apps is important and I'll try to get to that.
>> don’t disagree here at all. Just want to make sure that aiming at these important benchmarks does not
>> harm what is really more important – the day to day use.
The current sm implementation is aimed only at small smp node count, which was really the only relevant type of systems when this code was written 5 years ago. For large core counts there is a rather simple change that could be put in that is simple to implement, and will give you flat scaling for the sort of tests you are running. If you replace the fifo’s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple writers and single reader in non-threaded situation) there will be only one place to poll, regardless of the number of procs involved in the run. One still needs other optimizations to lower the absolute latency – perhaps what you have suggested. If one really has all N procs trying to write to the same fifo at once, performance will stink because of contention, but most apps don’t have that behaviour.Okay. Yes, I am a fan of that approach. But:
*) Doesn't strike me as a "simple" change.
>> instead of a fifo_write (or what ever is is called), an entry is posted to the “head” of a linked list, and the read is
>> removing an entry from the list. If one cares about memory locality, you need to return things to the appropiate
>> list, which is implicit in the fifo. More objects need to be allocated in shared memory.
*) Not sure this addresses all-to-all well. E.g., let's say you post a receive for a particular source. Do you then wade through a long FIFO to look for your match?
>> to pull things of the free list, you do need to look through what is on the queue. If it is not the match you are
>> looking for, just post it the the appropriate local list for later use, just like the matching logic does now. As
>> I mentioned this am, if you want, you don’t have to have a single list per destination, you could have several lists,
>> if you are concerned about too much contention.
What the RFC talks about is not the last SM development we'll ever need. It's only supposed to be one step forward from where we are today. The "single queue per receiver" approach has many advantages, but I think it's a different topic.
>> This is a big enough proposed change, that a call to describe this may be in place. I will state up front I am against
>> introducing MPI semantics into the btl. Not against having that sort of option in the code base, but do want to
>> preserve an option like the pml/btl abstraction.