Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] calling sendi earlier in the PML
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2009-03-03 06:26:08


Eugene Loh wrote:
> I'm on the verge of giving up moving the sendi call in the PML. I
> will try one or two last things, including this e-mail asking for
> feedback.
>
> The idea is that when a BTL goes over a very low-latency interconnect
> (like sm), we really want to shave off whatever we can from the
> software stack. One way of doing so is to use a "send-immediate"
> function, which a few BTLs (like sm) provide. The problem is avoiding
> a bunch of overhead introduced by the PML before checking for a
> "sendi()" call.
>
> Currently, the PML does something like this:
>
> for ( btl = ... ) {
> if ( SUCCESS == btl->sendi() ) return SUCCESS;
> if ( SUCCESS == btl->send() ) return SUCCESS;
> }
> return ERROR;
>
> That is, it roundrobins over all available BTLs, for each one trying
> sendi() and then send(). If ever a sendi or send completes
> successfully, we exit the loop successfully.
>
> The problem is that this loop is buried several functioncalls deep in
> the PML. Before it reaches this far, the PML has initialized a large
> "send request" data structure while traversing some (to me)
> complicated call graph of functions. This introduces a lot of
> overhead that mitigates much of the speedup we might hope to see with
> the sendi function. That overhead is unnecessary for a sendi call,
> but necessary for a send call. I've tried reorganizing the code to
> defer as much of that work as possible -- performing that overhead
> only if it's need to perform a send call -- but I've gotten braincramp
> every time I've tried this reorganization.
>
> I think these are the options:
>
> Option A) Punt!
>
> Option B) Have someone more familiar with the PML make these changes.
>
> Option C) Have Eugene keep working at this because he'll learn more
> about the PML and it's good for his character.
>
> Option D) Go to a strategy in which all BTLs are tried for sendi
> before any of them is tried for a send. The code would look like this:
>
> for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
> for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
> return ERROR;
>
> The reason this is so much easier to achieve is that we can put that
> first loop way up high in the PML (as soon as a send enters the PML,
> avoiding all that expensive overhead) and leave the second loop
> several layers down, where it is today. George is against this new
> loop structure because he thinks round robin selection of BTLs is most
> fair and distributes the load over BTLs as evenly as possible. (In
> contrast, the proposed loop would favor BTLs with sendi functions.)
> It seems to me, however, that favoring BTLs that have sendi functions
> is exactly the right thing to do! I'm not even convinced that the
> conditions he's worried about are that common: multiple eager BTLs to
> poll, one has a sendi, and that sendi is not very good or that BTL is
> getting overloaded.
>
I guess I agree with Eugene's points above. Since we are dealing mainly
with latency bound messages and not bandwidth spreading the messages
among btls really shouldn't provide much/any advantage. Maybe there is
a range of sizes that could provide more bandwidth with striped IB or
RNIC connections. But with the OpenIB multi-frags is there a way to
section out that message size such that it wouldn't be considered for sendi?

So lets say we are still inclined to write fastpath messages to BTLs
evenly. Maybe one modification to the above is check to see if the
connection we are writing does only have one BTL and try the btl_sendi
for that case higher in the stack. This would help with the SM BTL but
certainly striped OpenIB connections would not gain. I don't believe
other BTLs like TCP would matter either way.

--td
> Anyhow, I like Option D, but George does not.
>
> Option E) Go to a strategy in which the next BTL is tested for a sendi
> function. If there is one, use it. If not, just continue with the
> usual heavyweight PML procedure. This feels a little hackish to me,
> but it'll mean that most of the time that sendi can be called, the
> heavyweight PML overhead will be avoided, while at the same time
> "fair" roundrobin polling over the BTLs is maintained.
>
> I'll proceed with Option C for the time being. If I don't announce
> success or surrender in the next few days, please write to me at the
> insane asylum.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel