>> Do we really need a complete node map? A far as I can tell, it looks
>> like the MPI layer only needs a list of local processes. So maybe it
>> would be better to forget about the node ids at the mpi layer and just
>> return the local procs.
> I agree, though I don't think we want a parallel list of procs. We just need
> to set the "local" flag in the existing ompi_proc_t structures.
Having a parallel list of procs makes perfect sense. That way ORTE can store
ORTE information in the orte_proc_t and OMPI can store OMPI information in
the ompi_proc_t. The ompi_proc_t could either "inherit" the orte_proc_t or
have a pointer to it so that we have no duplication of data.
Having a global map makes sense, particularly for numerous communication
scenarios, if I know all the processes are on the same node I may send a
message to the lowest "vpid" on that node and he could then forward to
> One option is for the RTE to just pass in an enviro variable with a
> comma-separated list of your local ranks, but that creates a problem down
> the road when trying to integrate tighter with systems like SLURM where the
> procs would get mass-launched (so the environment has to be the same for all
> of them).
Having a enviro variable with at comma-seperated list of local ranks doesn't
seems like a bit of a hack to me.
>> So my vote would be to leave the modex alone, but remove the node id,
>> and add a function to get the list of local procs. It doesn't matter to
>> me how the RTE implements that.
> I think we would need to be careful here that we don't create a need for
> more communication. We have two functions currently in the modex:
> 1. how to exchange the info required to populate the ompi_proc_t structures;
> 2. how to identify which of those procs are "local"
> The problem with leaving the modex as it currently sits is that some
> environments require a different mechanism for exchanging the ompi_proc_t
> info. While most can use the RML, some can't. The same division of
> capabilities applies to getting the "local" info, so it makes sense to me to
> put the modex in a framework.
> Otherwise, we wind up with a bunch of #if's in the code to support
> environments like the Cray. I believe the mca system was put in place
> precisely to avoid those kind of practices, so it makes sense to me to take
> advantage of it.
>> Alternatively, if we did a process attribute system we could just use
>> predefined attributes, and the runtime can get each process's node id
>> however it wants.
> Same problem as above, isn't it? Probably ignorance on my part, but it seems
> to me that we simply exchange a modex framework for an attribute framework
> (since each environment would have to get the attribute values in a
> different manner) - don't we?
> I have no problem with using attributes instead of the modex, but the issue
> appears to be the same either way - you still need a framework to handle the
> different methods.
>> Ralph H Castain wrote:
>>> IV. RTE/MPI relative modex responsibilities
>>> The modex operation conducted during MPI_Init currently involves the
>>> exchange of two critical pieces of information:
>>> 1. the location (i.e., node) of each process in my job so I can determine
>>> who shares a node with me. This is subsequently used by the shared memory
>>> subsystem for initialization and message routing; and
>>> 2. BTL contact info for each process in my job.
>>> During our recent efforts to further abstract the RTE from the MPI layer, we
>>> pushed responsibility for both pieces of information into the MPI layer.
>>> This wasn't done capriciously - the modex has always included the exchange
>>> of both pieces of information, and we chose not to disturb that situation.
>>> However, the mixing of these two functional requirements does cause problems
>>> when dealing with an environment such as the Cray where BTL information is
>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>> location for each process.
>>> Hence, questions have been raised as to whether:
>>> (a) the modex should be built into a framework to allow multiple BTL
>>> exchange mechansims to be supported, or some alternative mechanism be used -
>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>> processes in a job (note: the modex may -use- that info, but would no longer
>>> be required to exchange it). This has a number of implications that need to
>>> be carefully considered - e.g., the memory required to store the node map in
>>> every process is non-zero. On the other hand:
>>> (i) every proc already -does- store the node for every proc - it is simply
>>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>>> would want to avoid duplicating that storage, but there would be no change
>>> in memory footprint if done carefully.
>>> (ii) every daemon already knows the node map for the job, so communicating
>>> that info to its local procs may not prove a major burden. However, the very
>>> environments where this subject may be an issue (e.g., the Cray) do not use
>>> our daemons, so some alternative mechanism for obtaining the info would be
>>> So the questions to be considered here are:
>>> (a) do we leave the current modex "as-is", to include exchange of the node
>>> map, perhaps including "#if" statements to support different exchange
>>> (b) do we separate the two functions currently in the modex and push the
>>> requirement to obtain a node map into the RTE? If so, how do we want the MPI
>>> layer to retrieve that info so we avoid increasing our memory footprint?
>>> (c) do we create a separate modex framework for handling the different
>>> exchange mechanisms for BTL info, do we incorporate it into an existing one
>>> (if so, which one), the new publish-subscribe framework, implement an
>>> alternative approach, or...?
>>> (d) other suggestions?
>>> devel mailing list
> devel mailing list