Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] RTE Issue IV: RTE/MPI relative modex responsibilities
From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-12-06 10:14:09


On 12/6/07 8:09 AM, "Shipman, Galen M." <gshipman_at_[hidden]> wrote:

> Sorry, to be clear that should have been:
>
>> One option is for the RTE to just pass in an enviro variable with a
>> comma-separated list of your local ranks, but that creates a problem down
>> the road when trying to integrate tighter with systems like SLURM where the
>> procs would get mass-launched (so the environment has to be the same for all
>> of them).
>>
> Having an enviro variable with a comma-seperated list of local ranks seems
> like a bit of a hack to me.

No argument - just trying to offer options for consideration. Not advocating
any of them yet. I'm still hoping for the "perfect solution" to show itself,
but I personally expect an acceptable compromise is the most likely
scenario.

>
>>>
>>> So my vote would be to leave the modex alone, but remove the node id,
>>> and add a function to get the list of local procs. It doesn't matter to
>>> me how the RTE implements that.
>>
>> I think we would need to be careful here that we don't create a need for
>> more communication. We have two functions currently in the modex:
>>
>> 1. how to exchange the info required to populate the ompi_proc_t structures;
>> and
>>
>> 2. how to identify which of those procs are "local"
>>
>> The problem with leaving the modex as it currently sits is that some
>> environments require a different mechanism for exchanging the ompi_proc_t
>> info. While most can use the RML, some can't. The same division of
>> capabilities applies to getting the "local" info, so it makes sense to me to
>> put the modex in a framework.
>>
>> Otherwise, we wind up with a bunch of #if's in the code to support
>> environments like the Cray. I believe the mca system was put in place
>> precisely to avoid those kind of practices, so it makes sense to me to take
>> advantage of it.
>>
>>
>>>
>>> Alternatively, if we did a process attribute system we could just use
>>> predefined attributes, and the runtime can get each process's node id
>>> however it wants.
>>
>> Same problem as above, isn't it? Probably ignorance on my part, but it seems
>> to me that we simply exchange a modex framework for an attribute framework
>> (since each environment would have to get the attribute values in a
>> different manner) - don't we?
>>
>> I have no problem with using attributes instead of the modex, but the issue
>> appears to be the same either way - you still need a framework to handle the
>> different methods.
>>
>>
>> Ralph
>>
>>>
>>> Tim
>>>
>>> Ralph H Castain wrote:
>>>> IV. RTE/MPI relative modex responsibilities
>>>> The modex operation conducted during MPI_Init currently involves the
>>>> exchange of two critical pieces of information:
>>>>
>>>> 1. the location (i.e., node) of each process in my job so I can determine
>>>> who shares a node with me. This is subsequently used by the shared memory
>>>> subsystem for initialization and message routing; and
>>>>
>>>> 2. BTL contact info for each process in my job.
>>>>
>>>> During our recent efforts to further abstract the RTE from the MPI layer,
>>>> we
>>>> pushed responsibility for both pieces of information into the MPI layer.
>>>> This wasn't done capriciously - the modex has always included the exchange
>>>> of both pieces of information, and we chose not to disturb that situation.
>>>>
>>>> However, the mixing of these two functional requirements does cause
>>>> problems
>>>> when dealing with an environment such as the Cray where BTL information is
>>>> "exchanged" via an entirely different mechanism. In addition, it has been
>>>> noted that the RTE (and not the MPI layer) actually "knows" the node
>>>> location for each process.
>>>>
>>>> Hence, questions have been raised as to whether:
>>>>
>>>> (a) the modex should be built into a framework to allow multiple BTL
>>>> exchange mechansims to be supported, or some alternative mechanism be used
>>>> -
>>>> one suggestion made was to implement an MPICH-like attribute exchange; and
>>>>
>>>> (b) the RTE should absorb responsibility for providing a "node map" of the
>>>> processes in a job (note: the modex may -use- that info, but would no
>>>> longer
>>>> be required to exchange it). This has a number of implications that need to
>>>> be carefully considered - e.g., the memory required to store the node map
>>>> in
>>>> every process is non-zero. On the other hand:
>>>>
>>>> (i) every proc already -does- store the node for every proc - it is simply
>>>> stored in the ompi_proc_t structures as opposed to somewhere in the RTE. We
>>>> would want to avoid duplicating that storage, but there would be no change
>>>> in memory footprint if done carefully.
>>>>
>>>> (ii) every daemon already knows the node map for the job, so communicating
>>>> that info to its local procs may not prove a major burden. However, the
>>>> very
>>>> environments where this subject may be an issue (e.g., the Cray) do not use
>>>> our daemons, so some alternative mechanism for obtaining the info would be
>>>> required.
>>>>
>>>>
>>>> So the questions to be considered here are:
>>>>
>>>> (a) do we leave the current modex "as-is", to include exchange of the node
>>>> map, perhaps including "#if" statements to support different exchange
>>>> mechanisms?
>>>>
>>>> (b) do we separate the two functions currently in the modex and push the
>>>> requirement to obtain a node map into the RTE? If so, how do we want the
>>>> MPI
>>>> layer to retrieve that info so we avoid increasing our memory footprint?
>>>>
>>>> (c) do we create a separate modex framework for handling the different
>>>> exchange mechanisms for BTL info, do we incorporate it into an existing one
>>>> (if so, which one), the new publish-subscribe framework, implement an
>>>> alternative approach, or...?
>>>>
>>>> (d) other suggestions?
>>>>
>>>> Ralph
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>