On 1/20/09 8:53 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
> This all sounds really great to me. I agree with most of what has
> been said -- e.g., benchmarks *are* important. Improving them can
> even sometimes have the side effect of improving real applications. ;-)
> My one big concern is the moving of architectural boundaries of making
> the btl understand MPI match headers. But even there, I'm torn:
> 1. I understand why it is better -- performance-wise -- to do this.
> And the performance improvement results are hard to argue with. We
> took a similar approach with ORTE; ORTE is now OMPI-specific, and
> many, many things have become better (from the OMPI perspective, at
> 2. We all have the knee-jerk reaction that we don't want to have the
> BTLs know anything about MPI semantics because they've always been
> that way and it has been a useful abstraction barrier. Now there's
> even a project afoot to move the BTLs out into a separate later that
> cannot know about MPI (so that other things can be built upon it).
> But are we sacrificing potential MPI performance here? I think that's
> one important question.
> Eugene: you mentioned that there are other possibilities to having the
> BTL understand match headers, such as a callback into the PML. Have
> you tried this approach to see what the performance cost would be,
How is this different from the way matching is done today ?
> I'd like to see George's reaction to this RFC, and Brian's (if he has
> On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote:
>> Patrick Geoffray wrote:
>>> Eugene Loh wrote:
>>>>> replace the fifo¹s with a single link list per process in shared
>>>>> memory, with senders to this process adding match envelopes
>>>>> atomically, with each process reading its own link list (multiple
>>>> *) Doesn't strike me as a "simple" change.
>>> Actually, it's much simpler than trying to optimize/scale the N^2
>>> implementation, IMHO.
>> 1) The version I talk about is already done. Check my putbacks.
>> done" is easier! :^)
>> 2) The two ideas are largely orthogonal. The RFC talks about a variety
>> of things: cleaning up the sendi function, moving the sendi call up
>> higher in the PML, bypassing the PML receive-request structure
>> to sendi), and stream-lining the data convertors in common cases. Only
>> one part of the RFC (directed polling) overlaps with having a single
>> FIFO per receiver.
>>>> *) Not sure this addresses all-to-all well. E.g., let's say you
>>>> post a
>>>> receive for a particular source. Do you then wade through a long
>>>> to look for your match?
>>> The tradeoff is between demultiplexing by the sender, which cost in
>>> and in space, or by the receiver, which cost an atomic inc. ANY_TAG
>>> forces you to demultiplex on the receive side anyway. Regarding
>>> all-to-all, it won't be more expensive if the receives are pre-
>>> and they should be.
>> Not sure I understand this paragraph. I do, however, think there are
>> great benefits to the single-receiver-queue model. It implies
>> on the receiver side in the many-to-one case, but if a single receiver
>> is reading all those messages anyhow, message-processing is already
>> going to throttle the message rate. The extra "bottleneck" at the FIFO
>> might never be seen.
>>>> What the RFC talks about is not the last SM development we'll ever
>>>> need. It's only supposed to be one step forward from where we are
>>>> today. The "single queue per receiver" approach has many
>>>> but I think it's a different topic.
>>> But is this intermediate step worth it or should we (well,
>>> you :-) ) go
>>> directly for the single queue model ?
>> To recap:
>> 1) The work is already done.
>> 2) The single-queue model addresses only one of the RFC's issues.
>> 3) I'm a fan of the single-queue model, but it's just a separate
>> devel mailing list