Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RFC: revamp topo framework
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-11-03 06:17:32


On Nov 3, 2009, at 3:40 AM, Luigi Scorzato wrote:

> This defines the precise relation between ranks and coordinates. Once
> I know this, I do not even need to write a topo component, because I
> can define the ranks of my computing nodes in a rankfile in order
> that they get the coordinates that they need physically.
>

Fair enough. A topo component would make it unnecessary to lay out
your processes in a specific order because it could (hypothetically)
understand your physical topology and re-order the ranks accordingly.

> A different issue is the BTL component. This is actually where my
> approach 1 and 2 differ (my previous distinction was confusing, due
> to my lack of understanding of the distinction between topo and btl
> components).
>
> In the 1st approach I would redefine some crucial (for my code) MPI
> functions in a way that they call the low level torus primitives,
> when the communication occurs between nearest neighbors, and fall
> back to open-mpi functions otherwise.
> The 2nd approach would be to develop our torus-btl. The fact that one
> can choose a "priority list of networks" is definitely great and
> dissipates my worries about the feasibility of the 2nd approach in my
> case. The only remaining question is whether I can get familiar with
> btl stuff fast enough. What do you suggest me to read in order to
> learn quickly how to create a BTL component?
>

The BTL is a bit more complicated than topo -- topo is actually pretty
straightforward. BTL is a dumb byte-pusher that is controlled by an
upper-level framework: the Point-to-point Messaging Layer (PML). The
PML effects the semantics of the MPI point-to-point communications;
PML components are the back-ends to MPI_SEND and friends. The PML
initializes BTLs during MPI_INIT and builds up the priority lists of
networks, etc. Then during MPI_SEND (etc.), the PML uses this
information to decide what to do with messages -- fragment them over
multiple BTLs, etc. It then calls the BTL modules in question to
actually do the send. On receive, the BTLs make upcalls to the PML
saying "here's a fragment; you handle it".

Hence, in this way, the BTLs are dumb byte pushers -- they simply send
and receive to individual peers (without any MPI semantics at all) and
give all the fragments they receive to the PML, who then effects all
the MPI semantics.

Read ompi/mca/btl/btl.h and ompi/mca/pml/pml.h for the details of the
interfaces.

Are the network primitives of your network like TCP (reads and writes
can partially complete), or are they like Myrinet / IB (messages are
read and written discretely, potentially also starting reads and
writes and later receiving completion calls indicating that they
finished)?

-- 
Jeff Squyres
jsquyres_at_[hidden]