Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] MPI Message Communication over TCP/IP
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-04-17 08:30:35


On Apr 16, 2009, at 8:58 PM, pranav jadhav wrote:

> Thanks for providing the details. I was going through the code of
> MPI_Send and I found a function pointer being invoked mca_pml.send
> of struct mca_pml_base_module_t. I am trying to figureout when are
> these PML function pointers get initialized to call internal BTL
> functions.

There's a somewhat-complicated setup dance during MPI_INIT when all
those function pointers get initialized. See below.

> I am trying to know how MPI program communicate over TPC/IP for
> message passing in a distributed setup and would appreciate if you
> can provide more details or any report that you would like to share.

The BTL (Byte Transfer Layer) is OMPI's lowest-layer for point-to-
point communications. The layering looks like this:

     MPI API
     PML (point-to-point messaging layer)
     BTL (byte transfer layer)

The PML also uses the BML (BTL multiplexing layer) to handle multiple
BTLs simultaneously. I don't really list it in the layering above
because it's just accounting functionality (arrays of BTL function
pointers); it's not really a "layer" in the traditional sense.

BTW, know that the BTLs are only used by the OB1 and CSUM PMLs.
There's a "dr" PML which is fairly dead at this point, and a CM PML,
which, for lack of a longer description, is used with different kinds
of networks (not TCP). So let's focus on OB1 / the TCP BTL.

At the bottom of MPI_SEND (and other point-to-point MPI API
functions), you'll see a call to mca_pml.<foo>. This calls a function
in the selected PML -- in your case, OB1. OB1 handles all the MPI
logic for point-to-point message passing: all the rules, matching,
ordering, and progression for MPI point-to-point message passing. The
BTLs are "simple" bit-pushers. They know nothing about MPI. They
take fragments from the PML and send them to peers. They receive
fragments from peers and give them to the upper-level PML.

That's the 50k foot level description.

Most of the function pointers you care about are setup during
MPI_INIT. There's a PML "selection" process that occurs -- Open MPI
queries every PML that it can find (e.g., those that were built as
plugins) and says "do you want to run?" If they answer yes, OMPI asks
them "what's your priority?" OMPI then selects the 1 PML that says
"yes, I want to run" with the highest priority. In your case, OB1 is
getting selected. All other PMLs are closed and OB1s function
pointers are loaded into the mca_pml struct. We then allow OB1 to
initialize itself (since it "won" the selection process).

Keep in mind that OB1 is an engine/state machine: it doesn't know how
to connect to or communicate with peers. It uses the BTLs for that.
So part of OB1's initialization is selecting which BTLs to use (unlike
the PML, where we only choose *1* PML to use at run-time, OB1 chooses
as many BTLs as say "yes, I want to run"). OB1 uses the BML to manage
the arrays of pointers to BTLs, but as I mentioned above, this is
simple accounting/bookkeeping code -- if you look in the R2 BML
module, it's just array manipulation stuff. Pretty straightforward.
So OB1 (R2) opens up all the BTLs that it can find and queries them
"do you want to run?" If the BTL answers "yes", then its function
pointers get added to R2's internal store of pointers. We then let
each BTL initialize itself (e.g., in TCP's case, open up a listening
socket).

Open MPI's code tree is organized as follows:

ompi/ -- top-level directory for all MPI-related code
   mca/ -- top-level directory for all frameworks
     pml/ -- top-level directory for all pml components (plugins)
       base/ -- top-level directory for pml "glue" code (i.e., shared
between all pml plugins)
       ob1/ -- directory for all code in the ob1 component (plugin)
       cm/ -- directory for all code in the cm component
     bml/ -- top-level directory for all bml components
       base/ -- top-level directory for bml "glue" code (i.e., shared
between all bml plugins)
       r2/ -- top-level directory for the r2 component
     btl/ -- top-level directory for all btl components
       base/ -- top-level directory for btl "glue" code (i.e., shared
between all btl plugins)
       tcp/ -- top-level directory for the tcp component

...I think you can see the pattern here:

   ompi/mca/<framework name>/<component name>

where the <component name> of "base" is special: it's the "glue" code
for that framework itself; it's not a component.

The interface for all plugins is always in a file of this form:

   ompi/mca/<framework>/<framework.h>

So look at ompi/mca/pml/pml.h and ompi/mca/btl/btl.h. We usually have
a decent overview of the component interface in those files.

That's the short answer of how OB1 and the BTLs startup and setup all
their function pointers. :-)

As for collectives, that's a different framework (e.g., as opposed to
PML, BML, BTL): the coll framework. We have a bunch of different
collective plugins available; which one(s) is(are) used depends on
several factors.

The coll selection process is significantly different than that of the
PML (e.g., OB1 and the BTLs), meaning that it's a bit more complex...
Have a look in ompi/mca/coll/coll.h for a description of how that
works. Hopefully, with the background that I've listed above, you can
read the comments in that file and have it make some semblance of
sense...

-- 
Jeff Squyres
Cisco Systems