Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: revamp topo framework
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-10-30 15:28:15

What George is describing is the Right answer, but it may take you a
little time.

FWIW: the complexity of a topo component is actually pretty low. It's
essentially a bunch of glue code (that I can probably mostly provide)
and your mapping algorithms about how to reorder the communicator ranks.

To be clear: topo components are *ONLY* about re-ordering ranks in a
communicator -- the back-end of MPI_CART_CREATE and friends.

The BTL components that George is talking about are Byte Transfer
Layer components; essentially the brains behind MPI_SEND and friends.
Open MPI has a per-device list of BTLs that can service each peer MPI
process. Hence, if you're sending to another MPI process on the same
host, the first BTL in the list will be the shared memory BTL. If
you're sending to an MPI process on a different server that you're
connected to via ethernet, the TCP BTL may be at the top of the list.
And so on.

Is sounds like you actually want to make *two* components:

- topo: for reordering ranks during MPI_CART_CREATE and friends
- btl: use the underlying network primitives for sending when possible

As George indicated, the BTL module in each MPI process can determine
during startup which MPI process peers it can talk to. It can then
tell the upper-layer routing algorithm "I can talk to peer processes
X, Y, and Z -- I cannot talk to peer processes A, B, and C". The
upper-layer router (the PML module) will then put your BTL at the top
of the list for peer processes X, Y, and Z, and will not put your BTL
on the list ofr peer processes A, B, and C. For A, B, and C, other
BTLs will be used (e.g., TCP).

Does that make sense?

To answer your question from a prior mail: the unity topo component is
used for the remapping of ranks in MPI_CART_CREATE. Look in ompi/mca/

On Oct 30, 2009, at 11:53 AM, George Bosilca wrote:

> Luigi,
> The current way Open MPI is selecting the network to be used between
> processes, match very well the first approach you proposed. As we
> support multiple networks simultaneously, a BTL (the low level network
> driver) can service only a subset of peers. All other communications
> will automatically be redirected through another BTL (which has to be
> available). In the past there were some attempts to route messages but
> this code is not in the trunk.
> george.
> On Oct 30, 2009, at 04:47 , Luigi Scorzato wrote:
> >
> >
> > I am very interested in this, but let me explain in more details my
> > present situation and goals.
> >
> > I am working in a group who is testing a system under development
> > which is connected with both:
> > - an ordinary all to all standard interface (where open-mpi is
> > already available) but with limited performances and scalability.
> > - a custom 3D torus network, with no mpi available, custom low-level
> > communication primitives (under development), from which we expect
> > higher performance and scalability.
> >
> >
> > I have two approaches in mind:
> >
> > 1st approach.
> > Use the standard network interface to setup MPI. However, through a
> > precompilation step, redefine a few MPI_ functions (MPI_Send()
> > MPI_Recv() and others) such that they call the torus primitives, if
> > the communication is between nearest neighbors, and fall back into
> > standard MPI through the standard interface if not. This can only
> > work if I can choose the mpi-ranks of my system in a way that
> > MPI_Cart_create() will generate coordinates consistent with the
> > physical topology.
> > ***There must be a place - somewhere in the open-mpi code - where
> > the cartesian coordinates are chosen, presumably as a deterministic
> > function of the mpi-ranks and the dimensions (as given by
> > MPI_Dims_create). I expected it to be in MPI_Cart_create(). But I
> > could not find it. Can anyone help?***
> > This approach has obvious limitations of portability, besides
> > requiring the availability of a fallback network, but it gives me
> > full control of what I need to do, which is essential since my
> > primary goal is to get a few important codes working in the new
> > system asap.
> >
> >
> > 2nd approach.
> > Develop a new "torus" topo component, as explained by Jeff. This is
> > certainly the *right* solution, but there are two problems:
> > - because of my poor familiarity with the open-mpi source code, I am
> > not able to estimate how long it will take me.
> > - in a first phase, the torus primitives will not support all to all
> > communications but only nearest neighbors ones. Hence, full
> > portability is excluded anyway and/or a fallback network still
> > needed. In other words, the topo component should be able to deal
> > with two networks, and I have no idea of how much this will
> > complicate things.
> >
> >
> > I necessarily have to push the 1st approach, for the moment, but I
> > am very much interested in studying the 2nd and if I see that it is
> > realistic (given the limitations above) and safe, I may turn to it
> > completely.
> >
> > thanks for your feedback and best regards, Luigi
> >
> > _______________________________________________
> > devel mailing list
> > devel_at_[hidden]
> >
> _______________________________________________
> devel mailing list
> devel_at_[hidden]

Jeff Squyres