Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: revamp topo framework
From: Luigi Scorzato (luigi.scorzato_at_[hidden])
Date: 2009-11-03 03:40:10

On 30 Oct 2009, at 20:28, Jeff Squyres wrote:

> What George is describing is the Right answer, but it may take you
> a little time.
> FWIW: the complexity of a topo component is actually pretty low.
> It's essentially a bunch of glue code (that I can probably mostly
> provide) and your mapping algorithms about how to reorder the
> communicator ranks.
> To be clear: topo components are *ONLY* about re-ordering ranks in
> a communicator -- the back-end of MPI_CART_CREATE and friends.
> The BTL components that George is talking about are Byte Transfer
> Layer components; essentially the brains behind MPI_SEND and
> friends. Open MPI has a per-device list of BTLs that can service
> each peer MPI process. Hence, if you're sending to another MPI
> process on the same host, the first BTL in the list will be the
> shared memory BTL. If you're sending to an MPI process on a
> different server that you're connected to via ethernet, the TCP BTL
> may be at the top of the list. And so on.
> Is sounds like you actually want to make *two* components:
> - topo: for reordering ranks during MPI_CART_CREATE and friends
> - btl: use the underlying network primitives for sending when possible
> As George indicated, the BTL module in each MPI process can
> determine during startup which MPI process peers it can talk to.
> It can then tell the upper-layer routing algorithm "I can talk to
> peer processes X, Y, and Z -- I cannot talk to peer processes A, B,
> and C". The upper-layer router (the PML module) will then put your
> BTL at the top of the list for peer processes X, Y, and Z, and will
> not put your BTL on the list ofr peer processes A, B, and C. For
> A, B, and C, other BTLs will be used (e.g., TCP).
> Does that make sense?
> To answer your question from a prior mail: the unity topo component
> is used for the remapping of ranks in MPI_CART_CREATE. Look in
> ompi/mca/topo/unity/.

Thanks to everybody for the clarifications. The function I was
looking for is mca_topo_base_cart_create() in ompi/mca/topo/base/
topo_base_cart_create.c And more precisely I needed the loop:

    p = topo_data->mtc_dims_or_index;
    coords = topo_data->mtc_coords;
    dummy_rank = *new_rank;
    for (i=0;
         (i < topo_data->mtc_ndims_or_nnodes && i < ndims);
         ++i, ++p) {
         dim = *p;
         nprocs /= dim;
         *coords++ = dummy_rank / nprocs;
         dummy_rank %= nprocs;

This defines the precise relation between ranks and coordinates. Once
I know this, I do not even need to write a topo component, because I
can define the ranks of my computing nodes in a rankfile in order
that they get the coordinates that they need physically.

A different issue is the BTL component. This is actually where my
approach 1 and 2 differ (my previous distinction was confusing, due
to my lack of understanding of the distinction between topo and btl

In the 1st approach I would redefine some crucial (for my code) MPI
functions in a way that they call the low level torus primitives,
when the communication occurs between nearest neighbors, and fall
back to open-mpi functions otherwise.
The 2nd approach would be to develop our torus-btl. The fact that one
can choose a "priority list of networks" is definitely great and
dissipates my worries about the feasibility of the 2nd approach in my
case. The only remaining question is whether I can get familiar with
btl stuff fast enough. What do you suggest me to read in order to
learn quickly how to create a BTL component?

Many thanks and best regards, Luigi