Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] calling sendi earlier in the PML
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-03-03 13:09:16

How about a compromise...

Keep a separate list somewhere of the sendi-enabled BTLs (this avoids
looping over all the btl's and testing -- you can just loop over the
btl's that you *know* have a sendi). Put that at the top of the PML
and avoid the costly overhead, yadda yadda yadda.

But instead of having a static list of sendi-enabled BTLs, rotate them
if there's >1. For example, say I have 3 sendi-enabled BTL modules:
A, B, C.

In the first send, A->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, B->sendi() is used and it succeeds, so we shuffle
the list and return.
In the next send, C->sendi() is used but it fails, so we shuffle the
list and fall through to normal ->send() processing.

"shuffle the list" can be as simple as opal_list_remove_first() and
opal_list_append() -- both of which should be O(1).

This should distribute the load across sendi-enabled BTLs, and if
those ever get "overloaded" (such that sendi fails), we fall through
to normal load-balanced PML sending.


On Mar 2, 2009, at 1:37 PM, Eugene Loh wrote:

> I'm on the verge of giving up moving the sendi call in the PML. I
> will try one or two last things, including this e-mail asking for
> feedback.
> The idea is that when a BTL goes over a very low-latency
> interconnect (like sm), we really want to shave off whatever we can
> from the software stack. One way of doing so is to use a "send-
> immediate" function, which a few BTLs (like sm) provide. The
> problem is avoiding a bunch of overhead introduced by the PML before
> checking for a "sendi()" call.
> Currently, the PML does something like this:
> for ( btl = ... ) {
> if ( SUCCESS == btl->sendi() ) return SUCCESS;
> if ( SUCCESS == btl->send() ) return SUCCESS;
> }
> return ERROR;
> That is, it roundrobins over all available BTLs, for each one trying
> sendi() and then send(). If ever a sendi or send completes
> successfully, we exit the loop successfully.
> The problem is that this loop is buried several functioncalls deep
> in the PML. Before it reaches this far, the PML has initialized a
> large "send request" data structure while traversing some (to me)
> complicated call graph of functions. This introduces a lot of
> overhead that mitigates much of the speedup we might hope to see
> with the sendi function. That overhead is unnecessary for a sendi
> call, but necessary for a send call. I've tried reorganizing the
> code to defer as much of that work as possible -- performing that
> overhead only if it's need to perform a send call -- but I've gotten
> braincramp every time I've tried this reorganization.
> I think these are the options:
> Option A) Punt!
> Option B) Have someone more familiar with the PML make these changes.
> Option C) Have Eugene keep working at this because he'll learn more
> about the PML and it's good for his character.
> Option D) Go to a strategy in which all BTLs are tried for sendi
> before any of them is tried for a send. The code would look like
> this:
> for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
> for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
> return ERROR;
> The reason this is so much easier to achieve is that we can put that
> first loop way up high in the PML (as soon as a send enters the PML,
> avoiding all that expensive overhead) and leave the second loop
> several layers down, where it is today. George is against this new
> loop structure because he thinks round robin selection of BTLs is
> most fair and distributes the load over BTLs as evenly as possible.
> (In contrast, the proposed loop would favor BTLs with sendi
> functions.) It seems to me, however, that favoring BTLs that have
> sendi functions is exactly the right thing to do! I'm not even
> convinced that the conditions he's worried about are that common:
> multiple eager BTLs to poll, one has a sendi, and that sendi is not
> very good or that BTL is getting overloaded.
> Anyhow, I like Option D, but George does not.
> Option E) Go to a strategy in which the next BTL is tested for a
> sendi function. If there is one, use it. If not, just continue
> with the usual heavyweight PML procedure. This feels a little
> hackish to me, but it'll mean that most of the time that sendi can
> be called, the heavyweight PML overhead will be avoided, while at
> the same time "fair" roundrobin polling over the BTLs is maintained.
> I'll proceed with Option C for the time being. If I don't announce
> success or surrender in the next few days, please write to me at the
> insane asylum.
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres
Cisco Systems