Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] Amateur Guidance
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-11-07 13:52:40


On Nov 7, 2008, at 11:41 AM, Timothy Hayes wrote:

> http://macneill.cs.tcd.ie/~hayesti/ompi.jpg

This is unfortunately not available to the outside world.

> N.B. The XEN component in the BTL layer represents what I'm trying
> to make.

So far so good, the BTL is what you need in order to move data between
MPI processes.
> When mpirun() is invoked, the following takes place
>
> 1. An out of band TCP channel is established between the
> process and every other process. This is located in the ORTE (Open
> Runtime Environment) -> MCA (Modular Component Architecture) -> OOB
> (Out of Band) -> TCP.

More or less. Usually each MPI process has an OOB channel to his
daemon, and these daemons are connected in between. Any OOB message
from a process to another process will go through these daemons.

> 2. A PML (Point-to-Point Management Layer) is created,
> defaulting to 'ob1' which can handle multiple communication
> interfaces simultaneously. This is located in OMPI (Open MPI) -> MCA
> (Modular Component Architecture) -> PML (Point-to-Point Management
> Layer) -> ob1

In the MPI application yes. There is a trick for matching capable
hardware (PSM or MX) but let's consider only the simplest process now.

> 3. 'ob1' attempts to set up one or more BTLs (Byte Transport
> Layer) components. These components are for establishing a point of
> contact with another process for data transfer. Examples include
> loopback for itself, shared memory for inter-process communication,
> TCP/IP for processes located on separate machines. There exist
> specialist components like infinibands should hardware and
> infrastructure become available.
> 4. Each component is cohesive and is responsible for finding
> the availability of resources specific to its own operation. Each
> component will return zero, one or many module instances depending
> on circumstance.
> 5. The out of band TCP channel is then used to communicate
> each process' instantiated modules to every other process.
>
>
> Questions with regard to the above
>
> Is the OOB channel permanent for the duration of mpirun()?
>
Yes, the OOB channel will exist as long as the job is running.
> I've read in places that the functions modex_send() & modex_recv()
> are used to communicate on the OOB channel, but I see
> mca_oob_tcp_send and mca_oob_tcp_recv declared in the header file.
> Is modex something else?
>
These functions are high level functions allowing the components to
send and receive informations required for startup. What they do, well
they gather some data from each process and propagate it globally. You
can see it as a allgather operation, a kind of global "business card"
exchange.
> What exactly is queried and returned when a BTL component creates
> modules. For example, if I run 4 MPI processes on the same machine,
> will the sm component return 1 sm module to communicate with each
> other process or 3 sm modules to communicate with 1 distinct module?
>
sm will always return just one BTL module. Some devices (such as MX)
will return one BTL per rail (physical NIC). For your specific case I
will return only one BTL, and on the add_proc I will only allow
connections to the processes on the same node (if what I understood
from your previous email is valid).
> Once again, those 5 points are really sparse and they're sparse
> because I don't know the detail myself. If anyone could shed some
> light on the process I would be really grateful.
>
> Kind regards
>
> Tim Hayes
>
>
> 2008/11/3 Jeff Squyres <jsquyres_at_[hidden]>
> On Nov 3, 2008, at 10:39 AM, Eugene Loh wrote:
>
> Main answer: no great docs to look at. I think I've asked some OMPI
> experts and that was basically the answer they gave me.
>
> This is unfortunately the current state of the art -- no one has had
> time to write up good docs.
>
> Galen pointed to the new papers -- our main PML these days is
> "ob1" (teg died a long time ago).
>
> PML = Point to point messaging layer; it's basically the layer that
> is right behind MPI_SEND and friends.
>
> The ob1 PML uses BTL modules underneath. BTL = Byte transfer layer;
> individual modules that send bytes back and forth over individual
> transports (e.g., shared memory, TCP, openfabrics, etc.). There's a
> BTL for each of the major transports that we support. The protocols
> that ob1 uses are described nicely in the papers that Galen sent,
> but the specific function interfaces are only best described in ompi/
> mca/btl/btl.h.
>
> Alternatively, we have a "cm" PML which uses MTL modules
> underneath. MTL = Matching transport layer; it's basically for
> transports that expose very MPI-like interfaces (e.g., elan, tports,
> PSM, portals, MX). This cm component is extremely thin; it
> basically provides a shim between Open MPI and the underlying
> transport.
>
> The big difference between cm and ob1 is that ob1 is a progress
> engine that tracks multiple transport interfaces (e.g., shared
> memory, tcp, openfabrics, ...etc. -- and therefore potentially
> multiple BTL module instances) and cm is a thin shim that simply
> translates between OMPI and the back-end interface -- cm will only
> use *ONE* MTL module instance. Specifically: it is expected that
> the one MTL module will do all the progression, striping, ...or
> whatever it wants to do to move bytes from A to B by itself (very
> little/no help at all from OMPI's infrastructure).
>
> Does that help some?
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



  • application/pkcs7-signature attachment: smime.p7s