This web mail archive is frozen.
This page is part of a frozen web archive of this mailing list.
You can still navigate around this archive, but know that no new mails
have been added to it since July of 2016.
Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.
Richard Graham wrote:
Re: [OMPI devel] RFC: sm Latency
First, the performance improvements look
Doesn't need to be much of an abstraction violation at all if, by that,
we mean teaching the BTL about the match header. Just need to make
some choices and I flagged that one for better visibility.
A few questions:
- How much of an abstraction violation does this introduce?
This looks like the btl needs to start
¡§knowing¡¨ about MPI level semantics.
That's one option. There are other options.
Currently, the btl purposefully is ulp
I ask for 2 reasons
Right, both to know if there is a match when the user had MPI_ANY_TAG
and to extract values to populate the MPI_Status variable. There are
other alternatives, like calling back the PML.
- you mention having the btl look at the match header (if I
- not clear to me what you mean by
returning the header to the list if the irecv does not complete. If it
does not complete, why not just pass the header back for further
processing, if all this is happening at the pml level ?
I was trying to read the FIFO to see what's on there. If it's
something we can handle, we take it and handle it. If it's too
complicated, then we just balk and tell the upper layer that we're
declining any possible action. That just seemed to me to be the
Here's an analogy. Let's say you have a house problem. You don't know
how bad it is. You think you might have to hire an expensive
contractor to do lots of work, but some local handyman thinks he can do
it quickly and cheaply. So, you have the handyman look at the job to
decide how involved it is. Let's say the handyman discovers that it
is, indeed, a big job. How would you like things left at this point?
*) Handyman says this is a big job. Bring in the expensive contractor
and big equipment. I left everything as I found it. Or,
*) Handyman says, "I took apart the this-and-this and I bought a bunch
of supplies. I ripped out the south wall. The water to the house is
turned off. Etc." You (and whoever has to come in to actually do the
work) would probably prefer that nothing had been started.
I thought it was cleaner to go the "do the whole job or don't do any of
- The measurements seem to be very dual
process specific. Have you looked at the impact of these changes on
other applications at the same process count ? ¡§Real¡¨ apps would be
interesting, but even hpl would be a good start.
Many measurements are for np=2. There are also np>2 HPCC pingpong
results though. (HPCC pingpong measures pingpong between two processes
while np-2 process sit in wait loops.) HPCC also measures "ring"
results... these are point-to-point with all np processes work.
HPL is pretty insensitive to point-to-point performance. It either
shows basically DGEMM performance or something is broken.
I haven't looked at "real" apps.
Let me be blunt about one thing: much of this is motivated by
simplistic (HPCC) benchmarks. This is for two reasons:
1) These benchmarks are highly visible.
2) It's hard to tune real apps when you know the primitives need work.
Looking at real apps is important and I'll try to get to that.
The current sm implementation is aimed only
at small smp node count, which was really the only relevant type of
systems when this code was written 5 years ago. For large core counts
there is a rather simple change that could be put in that is simple to
implement, and will give you flat scaling for the sort of tests you are
running. If you replace the fifo¡¦s with a single link list per process
in shared memory, with senders to this process adding match envelopes
atomically, with each process reading its own link list (multiple
writers and single reader in non-threaded situation) there will be only
one place to poll, regardless of the number of procs involved in the
run. One still needs other optimizations to lower the absolute latency
¡V perhaps what you have suggested. If one really has all N procs
trying to write to the same fifo at once, performance will stink
because of contention, but most apps don¡¦t have that behaviour.
Okay. Yes, I am a fan of that approach. But:
*) Doesn't strike me as a "simple" change.
*) Not sure this addresses all-to-all well. E.g., let's say you post a
receive for a particular source. Do you then wade through a long FIFO
to look for your match?
What the RFC talks about is not the last SM development we'll ever
need. It's only supposed to be one step forward from where we are
today. The "single queue per receiver" approach has many advantages,
but I think it's a different topic.