Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-12-04 10:47:35


IV. RTE/MPI relative modex responsibilities
The modex operation conducted during MPI_Init currently involves the
exchange of two critical pieces of information:

1. the location (i.e., node) of each process in my job so I can determine
who shares a node with me. This is subsequently used by the shared memory
subsystem for initialization and message routing; and

2. BTL contact info for each process in my job.

During our recent efforts to further abstract the RTE from the MPI layer, we
pushed responsibility for both pieces of information into the MPI layer.
This wasn't done capriciously - the modex has always included the exchange
of both pieces of information, and we chose not to disturb that situation.

However, the mixing of these two functional requirements does cause problems
when dealing with an environment such as the Cray where BTL information is
"exchanged" via an entirely different mechanism. In addition, it has been
noted that the RTE (and not the MPI layer) actually "knows" the node
location for each process.

Hence, questions have been raised as to whether:

(a) the modex should be built into a framework to allow multiple BTL
exchange mechansims to be supported, or some alternative mechanism be used -
one suggestion made was to implement an MPICH-like attribute exchange; and

(b) the RTE should absorb responsibility for providing a "node map" of the
processes in a job (note: the modex may -use- that info, but would no longer
be required to exchange it). This has a number of implications that need to
be carefully considered - e.g., the memory required to store the node map in
every process is non-zero. On the other hand:

(i) every proc already -does- store the node for every proc - it is simply
stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
would want to avoid duplicating that storage, but there would be no change
in memory footprint if done carefully.

(ii) every daemon already knows the node map for the job, so communicating
that info to its local procs may not prove a major burden. However, the very
environments where this subject may be an issue (e.g., the Cray) do not use
our daemons, so some alternative mechanism for obtaining the info would be
required.

So the questions to be considered here are:

(a) do we leave the current modex "as-is", to include exchange of the node
map, perhaps including "#if" statements to support different exchange
mechanisms?

(b) do we separate the two functions currently in the modex and push the
requirement to obtain a node map into the RTE? If so, how do we want the MPI
layer to retrieve that info so we avoid increasing our memory footprint?

(c) do we create a separate modex framework for handling the different
exchange mechanisms for BTL info, do we incorporate it into an existing one
(if so, which one), the new publish-subscribe framework, implement an
alternative approach, or...?

(d) other suggestions?

Ralph