Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] calling sendi earlier in the PML
From: Eugene Loh (Eugene.Loh_at_[hidden])
Date: 2009-03-02 13:37:08

I'm on the verge of giving up moving the sendi call in the PML. I will
try one or two last things, including this e-mail asking for feedback.

The idea is that when a BTL goes over a very low-latency interconnect
(like sm), we really want to shave off whatever we can from the software
stack. One way of doing so is to use a "send-immediate" function, which
a few BTLs (like sm) provide. The problem is avoiding a bunch of
overhead introduced by the PML before checking for a "sendi()" call.

Currently, the PML does something like this:

    for ( btl = ... ) {
        if ( SUCCESS == btl->sendi() ) return SUCCESS;
        if ( SUCCESS == btl->send() ) return SUCCESS;
    return ERROR;

That is, it roundrobins over all available BTLs, for each one trying
sendi() and then send(). If ever a sendi or send completes
successfully, we exit the loop successfully.

The problem is that this loop is buried several functioncalls deep in
the PML. Before it reaches this far, the PML has initialized a large
"send request" data structure while traversing some (to me) complicated
call graph of functions. This introduces a lot of overhead that
mitigates much of the speedup we might hope to see with the sendi
function. That overhead is unnecessary for a sendi call, but necessary
for a send call. I've tried reorganizing the code to defer as much of
that work as possible -- performing that overhead only if it's need to
perform a send call -- but I've gotten braincramp every time I've tried
this reorganization.

I think these are the options:

Option A) Punt!

Option B) Have someone more familiar with the PML make these changes.

Option C) Have Eugene keep working at this because he'll learn more
about the PML and it's good for his character.

Option D) Go to a strategy in which all BTLs are tried for sendi before
any of them is tried for a send. The code would look like this:

    for ( BTL = ... ) if ( SUCCESS == btl_sendi() ) return SUCCESS;
    for ( BTL = ... ) if ( SUCCESS == btl_send() ) return SUCCESS;
    return ERROR;

The reason this is so much easier to achieve is that we can put that
first loop way up high in the PML (as soon as a send enters the PML,
avoiding all that expensive overhead) and leave the second loop several
layers down, where it is today. George is against this new loop
structure because he thinks round robin selection of BTLs is most fair
and distributes the load over BTLs as evenly as possible. (In contrast,
the proposed loop would favor BTLs with sendi functions.) It seems to
me, however, that favoring BTLs that have sendi functions is exactly the
right thing to do! I'm not even convinced that the conditions he's
worried about are that common: multiple eager BTLs to poll, one has a
sendi, and that sendi is not very good or that BTL is getting overloaded.

Anyhow, I like Option D, but George does not.

Option E) Go to a strategy in which the next BTL is tested for a sendi
function. If there is one, use it. If not, just continue with the
usual heavyweight PML procedure. This feels a little hackish to me, but
it'll mean that most of the time that sendi can be called, the
heavyweight PML overhead will be avoided, while at the same time "fair"
roundrobin polling over the BTLs is maintained.

I'll proceed with Option C for the time being. If I don't announce
success or surrender in the next few days, please write to me at the
insane asylum.