This all sounds really great to me. I agree with most of what has
been said -- e.g., benchmarks *are* important. Improving them can
even sometimes have the side effect of improving real applications. ;-)
My one big concern is the moving of architectural boundaries of making
the btl understand MPI match headers. But even there, I'm torn:
1. I understand why it is better -- performance-wise -- to do this.
And the performance improvement results are hard to argue with. We
took a similar approach with ORTE; ORTE is now OMPI-specific, and
many, many things have become better (from the OMPI perspective, at
2. We all have the knee-jerk reaction that we don't want to have the
BTLs know anything about MPI semantics because they've always been
that way and it has been a useful abstraction barrier. Now there's
even a project afoot to move the BTLs out into a separate later that
cannot know about MPI (so that other things can be built upon it).
But are we sacrificing potential MPI performance here? I think that's
one important question.
Eugene: you mentioned that there are other possibilities to having the
BTL understand match headers, such as a callback into the PML. Have
you tried this approach to see what the performance cost would be,
I'd like to see George's reaction to this RFC, and Brian's (if he has
On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote:
> Patrick Geoffray wrote:
>> Eugene Loh wrote:
>>>> replace the fifos with a single link list per process in shared
>>>> memory, with senders to this process adding match envelopes
>>>> atomically, with each process reading its own link list (multiple
>>> *) Doesn't strike me as a "simple" change.
>> Actually, it's much simpler than trying to optimize/scale the N^2
>> implementation, IMHO.
> 1) The version I talk about is already done. Check my putbacks.
> done" is easier! :^)
> 2) The two ideas are largely orthogonal. The RFC talks about a variety
> of things: cleaning up the sendi function, moving the sendi call up
> higher in the PML, bypassing the PML receive-request structure
> to sendi), and stream-lining the data convertors in common cases. Only
> one part of the RFC (directed polling) overlaps with having a single
> FIFO per receiver.
>>> *) Not sure this addresses all-to-all well. E.g., let's say you
>>> post a
>>> receive for a particular source. Do you then wade through a long
>>> to look for your match?
>> The tradeoff is between demultiplexing by the sender, which cost in
>> and in space, or by the receiver, which cost an atomic inc. ANY_TAG
>> forces you to demultiplex on the receive side anyway. Regarding
>> all-to-all, it won't be more expensive if the receives are pre-
>> and they should be.
> Not sure I understand this paragraph. I do, however, think there are
> great benefits to the single-receiver-queue model. It implies
> on the receiver side in the many-to-one case, but if a single receiver
> is reading all those messages anyhow, message-processing is already
> going to throttle the message rate. The extra "bottleneck" at the FIFO
> might never be seen.
>>> What the RFC talks about is not the last SM development we'll ever
>>> need. It's only supposed to be one step forward from where we are
>>> today. The "single queue per receiver" approach has many
>>> but I think it's a different topic.
>> But is this intermediate step worth it or should we (well,
>> you :-) ) go
>> directly for the single queue model ?
> To recap:
> 1) The work is already done.
> 2) The single-queue model addresses only one of the RFC's issues.
> 3) I'm a fan of the single-queue model, but it's just a separate
> devel mailing list