Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] calling sendi earlier in the PML
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-03-04 06:25:38


I didn't see exchange between you and Jeff at the end of this email. It
basically nullifies my half-baked concern.

thanks,

--td

Eugene Loh wrote:
> Terry Dontje wrote:
>
>> Eugene Loh wrote:
>>
>>> I'm on the verge of giving up moving the sendi call in the PML. I
>>> will try one or two last things, including this e-mail asking for
>>> feedback.
>>>
>>> The idea is that when a BTL goes over a very low-latency
>>> interconnect (like sm), we really want to shave off whatever we can
>>> from the software stack. One way of doing so is to use a
>>> "send-immediate" function, which a few BTLs (like sm) provide. The
>>> problem is avoiding a bunch of overhead introduced by the PML before
>>> checking for a "sendi()" call.
>>>
>>> Currently, the PML does something like this:
>>>
>>> for ( btl = ... ) {
>>> if ( SUCCESS == btl->sendi() ) return SUCCESS;
>>> if ( SUCCESS == btl->send() ) return SUCCESS;
>>> }
>>> return ERROR;
>>>
>>> That is, it roundrobins over all available BTLs, for each one trying
>>> sendi() and then send(). If ever a sendi or send completes
>>> successfully, we exit the loop successfully.
>>>
>>> The problem is that this loop is buried several functioncalls deep
>>> in the PML. Before it reaches this far, the PML has initialized a
>>> large "send request" data structure while traversing some (to me)
>>> complicated call graph of functions. This introduces a lot of
>>> overhead that mitigates much of the speedup we might hope to see
>>> with the sendi function. That overhead is unnecessary for a sendi
>>> call, but necessary for a send call. I've tried reorganizing the
>>> code to defer as much of that work as possible -- performing that
>>> overhead only if it's need to perform a send call -- but I've gotten
>>> braincramp every time I've tried this reorganization.
>>>
>>> I think these are the options:
>>>
>>> Option A) Punt!
>>>
>>> Option B) Have someone more familiar with the PML make these changes.
>>>
>>> Option C) Have Eugene keep working at this because he'll learn more
>>> about the PML and it's good for his character.
>>>
>>> Option D) Go to a strategy in which all BTLs are tried for sendi
>>> before any of them is tried for a send. The code would look like this:
>>>
>>> for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
>>> for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
>>> return ERROR;
>>>
>>> The reason this is so much easier to achieve is that we can put that
>>> first loop way up high in the PML (as soon as a send enters the PML,
>>> avoiding all that expensive overhead) and leave the second loop
>>> several layers down, where it is today. George is against this new
>>> loop structure because he thinks round robin selection of BTLs is
>>> most fair and distributes the load over BTLs as evenly as possible.
>>> (In contrast, the proposed loop would favor BTLs with sendi
>>> functions.) It seems to me, however, that favoring BTLs that have
>>> sendi functions is exactly the right thing to do! I'm not even
>>> convinced that the conditions he's worried about are that common:
>>> multiple eager BTLs to poll, one has a sendi, and that sendi is not
>>> very good or that BTL is getting overloaded.
>>>
>> I guess I agree with Eugene's points above. Since we are dealing
>> mainly with latency bound messages and not bandwidth spreading the
>> messages among btls really shouldn't provide much/any advantage.
>
> I think that's right, but to be fair to George, I think his point is
> that even short messages can congest a BTL.
>
>> Maybe there is a range of sizes that could provide more bandwidth
>> with striped IB or RNIC connections. But with the OpenIB
>> multi-frags is there a way to section out that message size such that
>> it wouldn't be considered for sendi?
>
> I'm not sure I understand the question. A message longer than the
> eager size automatically does not qualify for sendi.
>
> Also, the existence of a sendi path has to do with the BTL component,
> not with a particular NIC or something. Not sure if that's relevent
> or not.
>
>> So lets say we are still inclined to write fastpath messages to BTLs
>> evenly. Maybe one modification to the above is check to see if the
>> connection we are writing does only have one BTL and try the
>> btl_sendi for that case higher in the stack. This would help with
>> the SM BTL but certainly striped OpenIB connections would not gain.
>> I don't believe other BTLs like TCP would matter either way.
>
> One can special-case sm. E.g., if there is only one BTL, try sendi
> early. Or, try sendi (early) only for the next BTL... if none, then
> dive down into the rest of the code.
>
> I'm still not sure about the striped-openib point you're making. The
> following may or may not make sense depending on how ridiculously off
> base my understanding or nomenclature is. Let's start with an example
> Jeff brought up recently:
>
> Jeff Squyres wrote:
>
>> Example: if I have a dual-port IB HCA, Open MPI will make 2
>> different openib BTL modules. In this case, the openib BTL will
>> need to know exactly which module the PML is trying to sendi on.
>
> So, here there are two modules the PML could send on. They're both
> openib modules. So, we either define an openib sendi function (in
> which case short messages will be distributed equally over both
> connections) or we don't (in which case short messages will still be
> distributed equally over both connetions!).
>
> The "problem" is only if you have two different BTL components. E.g.,
> one connection is mx and two connections are tcp. If mx defines a
> sendi, then all short messages over mx and none over tcp. Sounds good
> in this case, but presumably that's due to my loaded example.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel