Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI users] How can I tell (open-)mpi about the HW topology ofmy system?
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-10-27 19:58:09


On Oct 27, 2009, at 6:12 AM, Luigi Scorzato wrote:

> >However, we do have the good foresight (if I do say so
> myself ;-) ) to
> >make the MPI topology system be a plugin in Open MPI. The only
> plugin
> >for this system is currently the "do nothing" plugin, but it would
> >*not* be difficult to write one that actually did something
> meaningful
> >in your torus.
>
> >If you're interested, I'd be happy to explain how to do it (and we
> >should probably move to the devel list). OMPI doesn't require too
> >much framework code; I would guess that the majority of the code
> would
> >actually be implementing whatever algorithms you wanted for your
> >torus. Heck, you could even write a blind-and-dumb algorithm that
> >just looks up tables in files based on hostnames in your torus.
>
> I am very much interested. Could you please suggest me where I should
> look into?
>

(moved to devel from users list)

Open MPI has two entities that you need to know about: frameworks and
components (components are also referred to as "plugins"). Frameworks
are the glue for a specific kind of component (plugin). For example,
we have a framework for MPI point-to-point messages. We have another
framework for MPI collective operations. We have another framework
(the one you care about) for MPI topology operations. And so on. In
each framework, there's one or more components (plugins) that are
loaded and used at run-time to effect the functionality in that
framework.

Example: one of the MPI point-to-point messaging frameworks is called
the "BTL" (byte transfer layer). We have a bunch of BTL components:
one for TCP, one for shared memory, one for process loopback, one for
MX, one for OpenFabrics verbs, ...etc. These plugins are effectively
(eventually) called when you call MPI_SEND, MPI_RECV, ...etc.

Example: another MPI framework is "coll" -- MPI collective
operations. We have several components that effect different
algorithms and transports underneath. These plugins are called when
you call MPI_BARRIER, MPI_BCAST, MPI_SCATTER, ...etc.

Example: the "topo" MPI framework is for MPI topology operations. We
currently only have one component in this framework, named
"unity" (because it makes no transformation of ranks). The functions
in these components are called when you call MPI_CART_CREATE,
MPI_GRAPH_CREATE, ...etc.

Frameworks can be found in the OMPI source code in ompi/mca/
<framework>. There's always a header file named ompi/mca/<framework>/
<framework.h>. Components are always specific to a single framework,
and can be found in the OMPI source code in ompi/mca/<framework>/
<component>.

So you want to make a new topo component that can remap ranks based on
your network topology, perhaps in ompi/mca/topo/luigi/ or ompi/mca/
topo/torus/ or whatever.

See these wiki pages:

   https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
   --> will give you an appreciation of what frameworks are
   https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
   --> step-by-step instructions on how to make a new luigi or torus
or whatever component

I would suggest getting an SVN checkout of the OMPI trunk (see http://www.open-mpi.org/svn/)
  and working on your new component there.

The file ompi/mca/topo/topo.h file has a decent description of the
topo component interface (i.e., the functions that your new component
will need to provide). Note that the MPI cartesian and graph
communicator interfaces were cleverly designed such that all the cart
functions can be implemented in terms of MPI_CART_MAP and all the
graph functions can be implemented in terms of MPI_GRAPH_MAP. So
aside from OMPI "glue" code, your plugins may only need to provide
those two functions to be fully functional.

I'd advise using the unity component as an example to create a new
component, and then fill in whatever algorithms you want.

Some more OMPI terminology: a "module" is an "instance" of a
component. Think of a "component" as a C++ class; think of a "module"
as C++ object. The "base" is the glue of a framework that makes it
run (e.g., the functions for opening the framework, traversing found
components, closing the framework, etc.).

The basic startup sequence is that OMPI will call the init_query
function on your component the first time MPI_CART_CREATE or
MPI_GRAPH_CREATE is invoked and see if it wants to run. If it does,
the component is added to a list of "available" components.

Every time a graph or cart communciator is created, the list of
available topo components is traversed and the component comm_query
function is invoked. The comm_query function indicates whether it can
be used or not by returning a module or a NULL. The base maintains a
list of modules that were returned and selects the one with the
highest priority. comm_unquery is called on all the losers;
module_init is invoked on the winner.

Check out the code in ompi/mca/topo/base/topo_base_comm_select.c --
there's a good amount of comments in there about how per-communicator
selection occurs.

--> Hmm. I'm looking at the prototype for comm_query in topo.h and it
doesn't take a list of processes. This seems like a bad idea; a
component may only be able to run on a subset of processes in the
overall MPI job (e.g., if you have a shared-memory topology component,
it would only allow itself to be used at run-time if all processes in
the communicator are physically located on the same node). Hmm. We
might want to update this prototype to include a list of processes
that you can check to see if your component is eligible.
Additionally, it seems weird that the comm_unquery function is on the
component -- it really should be on the module (editor's note: this
framework was created way back during the beginning of OMPI and likely
hasn't been touched since... I think it's showing its age :-\ ).

Once a module is selected, its function pointers effectively become
the back-ends to functions like MPI_CART_CREATE, MPI_GRAPH_CREATE,
etc. Note that you can implement all the topology functions in terms
of MPI_CART_MAP and MPI_GRAPH_MAP (this is what unity does). If you
provide NULL for all the other function pointers, the base will
automatically insert functions that implement themselves by calling
your module's cart_map and graph_map functions.

Note that in order to save some space, we overlap the meanings of some
fields (graph dimensions or list of indexes). In hindsight, I'm not
sure why we didn't use a union. :-\

Finally, when the communicator is destroyed, the module_finalize
function is invoked.

=====

Based on my "Hmm..." comment above, I think I want to revamp the
selection logic a little before you dive too deeply into this -- to
modernize it and make it a bit more like the rest of the OMPI code
base; you can tell that this code was created a long time ago and then
has been touched since (you're the first person to express interest in
creating a real topo component! :-) ). I've created a Mercurial
branch of the OMPI trunk for this work and published it here:

     http://bitbucket.org/jsquyres/ompi-topo-fixes/

Give me a few days to get this branch into shape (and potentially to
get it back to the SVN trunk). I might even get inspired to make a
template 2nd component for you (i.e., I might need a 2nd component
just to ensure that the selection logic is working :-) ).

-- 
Jeff Squyres
jsquyres_at_[hidden]