Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-12-06 10:14:09

On 12/6/07 8:09 AM, "Shipman, Galen M." <gshipman_at_[hidden]> wrote:

> Sorry, to be clear that should have been:
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
> Having an enviro variable with a comma-seperated list of local ranks seems
> like a bit of a hack to me.

No argument - just trying to offer options for consideration. Not advocating
any of them yet. I'm still hoping for the "perfect solution" to show itself,
but I personally expect an acceptable compromise is the most likely

>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>> 2. how to identify which of those procs are "local"
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>> Ralph
>>> Tim
>>> Ralph H Castain wrote:
>>>> IV. RTE/MPI relative modex responsibilities
>>>> The modex operation conducted during MPI_Init currently involves the
>>>> exchange of two critical pieces of information:
>>>> 1. the location (i.e., node) of each process in my job so I can determine
>>>> who shares a node with me. This is subsequently used by the shared memory
>>>> subsystem for initialization and message routing; and
>>>> 2. BTL contact info for each process in my job.
>>>> During our recent efforts to further abstract the RTE from the MPI layer,
>>>> we
>>>> pushed responsibility for both pieces of information into the MPI layer.
>>>> This wasn't done capriciously - the modex has always included the exchange
>>>> of both pieces of information, and we chose not to disturb that situation.
>>>> However, the mixing of these two functional requirements does cause
>>>> problems
>>>> when dealing with an environment such as the Cray where BTL information is
>>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>>> location for each process.
>>>> Hence, questions have been raised as to whether:
>>>> (a) the modex should be built into a framework to allow multiple BTL
>>>> exchange mechansims to be supported, or some alternative mechanism be used
>>>> -
>>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>>> processes in a job (note: the modex may -use- that info, but would no
>>>> longer
>>>> be required to exchange it). This has a number of implications that need to
>>>> be carefully considered - e.g., the memory required to store the node map
>>>> in
>>>> every process is non-zero. On the other hand:
>>>> (i) every proc already -does- store the node for every proc - it is simply
>>>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>>>> would want to avoid duplicating that storage, but there would be no change
>>>> in memory footprint if done carefully.
>>>> (ii) every daemon already knows the node map for the job, so communicating
>>>> that info to its local procs may not prove a major burden. However, the
>>>> very
>>>> environments where this subject may be an issue (e.g., the Cray) do not use
>>>> our daemons, so some alternative mechanism for obtaining the info would be
>>>> required.
>>>> So the questions to be considered here are:
>>>> (a) do we leave the current modex "as-is", to include exchange of the node
>>>> map, perhaps including "#if" statements to support different exchange
>>>> mechanisms?
>>>> (b) do we separate the two functions currently in the modex and push the
>>>> requirement to obtain a node map into the RTE? If so, how do we want the
>>>> MPI
>>>> layer to retrieve that info so we avoid increasing our memory footprint?
>>>> (c) do we create a separate modex framework for handling the different
>>>> exchange mechanisms for BTL info, do we incorporate it into an existing one
>>>> (if so, which one), the new publish-subscribe framework, implement an
>>>> alternative approach, or...?
>>>> (d) other suggestions?
>>>> Ralph
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]