Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] calling sendi earlier in the PML
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-03 15:53:53


Terry Dontje wrote:

> Eugene Loh wrote:
>
>> I'm on the verge of giving up moving the sendi call in the PML. I
>> will try one or two last things, including this e-mail asking for
>> feedback.
>>
>> The idea is that when a BTL goes over a very low-latency interconnect
>> (like sm), we really want to shave off whatever we can from the
>> software stack. One way of doing so is to use a "send-immediate"
>> function, which a few BTLs (like sm) provide. The problem is
>> avoiding a bunch of overhead introduced by the PML before checking
>> for a "sendi()" call.
>>
>> Currently, the PML does something like this:
>>
>> for ( btl = ... ) {
>> if ( SUCCESS == btl->sendi() ) return SUCCESS;
>> if ( SUCCESS == btl->send() ) return SUCCESS;
>> }
>> return ERROR;
>>
>> That is, it roundrobins over all available BTLs, for each one trying
>> sendi() and then send(). If ever a sendi or send completes
>> successfully, we exit the loop successfully.
>>
>> The problem is that this loop is buried several functioncalls deep in
>> the PML. Before it reaches this far, the PML has initialized a large
>> "send request" data structure while traversing some (to me)
>> complicated call graph of functions. This introduces a lot of
>> overhead that mitigates much of the speedup we might hope to see with
>> the sendi function. That overhead is unnecessary for a sendi call,
>> but necessary for a send call. I've tried reorganizing the code to
>> defer as much of that work as possible -- performing that overhead
>> only if it's need to perform a send call -- but I've gotten
>> braincramp every time I've tried this reorganization.
>>
>> I think these are the options:
>>
>> Option A) Punt!
>>
>> Option B) Have someone more familiar with the PML make these changes.
>>
>> Option C) Have Eugene keep working at this because he'll learn more
>> about the PML and it's good for his character.
>>
>> Option D) Go to a strategy in which all BTLs are tried for sendi
>> before any of them is tried for a send. The code would look like this:
>>
>> for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
>> for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
>> return ERROR;
>>
>> The reason this is so much easier to achieve is that we can put that
>> first loop way up high in the PML (as soon as a send enters the PML,
>> avoiding all that expensive overhead) and leave the second loop
>> several layers down, where it is today. George is against this new
>> loop structure because he thinks round robin selection of BTLs is
>> most fair and distributes the load over BTLs as evenly as possible.
>> (In contrast, the proposed loop would favor BTLs with sendi
>> functions.) It seems to me, however, that favoring BTLs that have
>> sendi functions is exactly the right thing to do! I'm not even
>> convinced that the conditions he's worried about are that common:
>> multiple eager BTLs to poll, one has a sendi, and that sendi is not
>> very good or that BTL is getting overloaded.
>>
> I guess I agree with Eugene's points above. Since we are dealing
> mainly with latency bound messages and not bandwidth spreading the
> messages among btls really shouldn't provide much/any advantage.

I think that's right, but to be fair to George, I think his point is
that even short messages can congest a BTL.

> Maybe there is a range of sizes that could provide more bandwidth with
> striped IB or RNIC connections. But with the OpenIB multi-frags is
> there a way to section out that message size such that it wouldn't be
> considered for sendi?

I'm not sure I understand the question. A message longer than the eager
size automatically does not qualify for sendi.

Also, the existence of a sendi path has to do with the BTL component,
not with a particular NIC or something. Not sure if that's relevent or not.

> So lets say we are still inclined to write fastpath messages to BTLs
> evenly. Maybe one modification to the above is check to see if the
> connection we are writing does only have one BTL and try the btl_sendi
> for that case higher in the stack. This would help with the SM BTL
> but certainly striped OpenIB connections would not gain. I don't
> believe other BTLs like TCP would matter either way.

One can special-case sm. E.g., if there is only one BTL, try sendi
early. Or, try sendi (early) only for the next BTL... if none, then
dive down into the rest of the code.

I'm still not sure about the striped-openib point you're making. The
following may or may not make sense depending on how ridiculously off
base my understanding or nomenclature is. Let's start with an example
Jeff brought up recently:

Jeff Squyres wrote:

> Example: if I have a dual-port IB HCA, Open MPI will make 2 different
> openib BTL modules. In this case, the openib BTL will need to know
> exactly which module the PML is trying to sendi on.

So, here there are two modules the PML could send on. They're both
openib modules. So, we either define an openib sendi function (in which
case short messages will be distributed equally over both connections)
or we don't (in which case short messages will still be distributed
equally over both connetions!).

The "problem" is only if you have two different BTL components. E.g.,
one connection is mx and two connections are tcp. If mx defines a
sendi, then all short messages over mx and none over tcp. Sounds good
in this case, but presumably that's due to my loaded example.