Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-22 15:59:37


There was a very long drawn-out discussion about this early in 2007.
Rather than rehash all that, I'll try to summarize it here. It may get
confusing - it helped a whole lot to be in a room with a whiteboard.
There were also presentations on the subject - I believe the slides
may still be in the docs repository.

Because terminology quickly gets confusing, we adopted a slightly
different one for these discussions. We talk about OMPI being a
"single cell" system - i.e., jobs executed via mpirun can only span
nodes that are reachable by that mpirun. In a typical managed
environment, a cell aligns quite well with a "cluster". In an
unmanaged environment where the user provides a hostfile, the cell
will contain all nodes specified in the hostfile.

We don't filter or abort for non-matching hostnames - if mpirun can
launch on that node, then great. What we don't support is asking
mpirun to remotely execute another mpirun on the frontend of another
cell in order to launch procs on the nodes in -that- cell, nor do we
ask mpirun to in any way manage (or even know about) any procs running
on a remote cell.

I see what you are saying about the ALPS node name. However, the field
you want to add doesn't have anything to do with accept/connect. The
orte_node_t object is used solely by mpirun to keep track of the node
pool it controls - i.e., the nodes upon which it is launching jobs.
Thus, the mpirun on cluster A will have "nidNNNN" entries it got from
its allocation, and the mpirun on cluster B will have "nidNNNN"
entries it got from its allocation - but the two mpiruns will never
exchange that information, nor will the mpirun on cluster A ever have
a need to know the node entries for cluster B. Each mpirun launches
and manages procs -only- on the nodes in its own allocation.

I agree you will have issues when doing the connect/accept modex as
the nodenames are exchanged and are no longer unique in your scenario.
However, that info stays in the ompi_proc_t - it never gets
communicated to the ORTE layer as we couldn't care less down there
about the remote procs since they are under the control of a different
mpirun. So if you need to add a cluster id field for this purpose, it
needs to go in ompi_proc_t - not in the orte structures.

And for that, you probably need to discuss it with the MPI team as
changes to ompi_proc_t will likely generate considerable discussion.

FWIW: this is one reason I warned Galen about the problems in reviving
multi-cluster operations again. We used to deal with multi-cells in
the process name itself, but all that support has been removed from
OMPI.

Hope that helps
Ralph

On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:

> I may be opening a can of worms...
>
> But, what prevents a user from running across clusters in a "normal
> OMPI", i.e., non-ALPS environment? When he puts hosts into his
> hostfile, does it parse and abort/filter non-matching hostnames? The
> problem for ALPS based systems is that nodes are addressed via NID,PID
> pairs at the portals level. Thus, these are unique only within a
> cluster. In point of fact, I could rewrite all of the ALPS support to
> identify the nodes by "cluster_id".NID. It would be a bit inefficient
> within a cluster because, we would have to extract the NID from this
> syntax as we go down to the portals layer. It also would lead to a
> larger degree of change within the OMPI ALPS code base. However, I
> can
> give ALPS-based systems the same feature set as the rest of the world.
> It just is more efficient to use an additional pointer in the
> orte_node_t structure and results is a far simpler code structure.
> This
> makes it easier to maintain.
>
> The only thing that "this change" really does is to identify the
> cluster
> under which the ALPS allocation is made. If you are addressing a node
> in another cluster, (e.g., via accept/connect), the clustername/NID
> pair
> is unique for ALPS as a hostname on a cluster node is unique between
> clusters. If you do a gethostname() on a normal cluster node, you are
> going to get mynameNNNNN, or something similar. If you do a
> gethostname() on an ALPS node, you are going to get nidNNNNN; there is
> no differentiation between cluster A and cluster B.
>
> Perhaps, my earlier comment was not accurate. In reality, it provides
> the same degree of identification for ALPS nodes as hostname provides
> for normal clusters. From your perspective, it is immaterial that it
> also would allow us to support our limited form of multi-cluster
> support. However, of and by itself, it only provides the same level
> of
> identification as is done for other cluster nodes.
> --
> Ken
>
>
> -----Original Message-----
> From: Ralph Castain [mailto:rhc_at_[hidden]]
> Sent: Monday, September 22, 2008 2:33 PM
> To: Open MPI Developers
> Cc: Matney Sr, Kenneth D.
> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>
> The issue isn't with adding a string. The question is whether or not
> OMPI is to support one job running across multiple clusters. We made a
> conscious decision (after lengthy discussions on OMPI core and ORTE
> mailing lists, plus several telecons) to not do so - we require that
> the job execute on a single cluster, while allowing connect/accept to
> occur between jobs on different clusters.
>
> It is difficult to understand why we need a string (or our old "cell
> id") to tell us which cluster we are on if we are only following that
> operating model. From the commit comment, and from what I know of the
> system, the only rationale for adding such a designator is to shift
> back to the one-mpirun-spanning-multiple-cluster model.
>
> If we are now going to make that change, then it merits a similar
> level of consideration as the last decision to move away from that
> model. Making that move involves considerably more than just adding a
> cluster id string. You may think that now, but the next step is
> inevitably to bring back remote launch, killing jobs on all clusters
> when one cluster has a problem, etc.
>
> Before we go down this path and re-open Pandora's box, we should at
> least agree that is what we intend to do...or agree on what hard
> constraints we will place on multi-cluster operations. Frankly, I'm
> tired of bouncing back-and-forth on even the most basic design
> decisions.
>
> Ralph
>
>
>
> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>
>> What Ken put in is what is needed for the limited multi-cluster
>> capabilities
>> we need, just one additional string. I don't think there is a need
>> for any
>> discussion of such a small change.
>>
>> Rich
>>
>>
>> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>
>>> We really should discuss that as a group first - there is quite a
>>> bit
>>> of code required to actually support multi-clusters that has been
>>> removed.
>>>
>>> Our operational model that was agreed to quite a while ago is that
>>> mpirun can -only- extend over a single "cell". You can connect/
>>> accept
>>> multiple mpiruns that are sitting on different cells, but you cannot
>>> execute a single mpirun across multiple cells.
>>>
>>> Please keep this on your own development branch for now. Bringing it
>>> into the trunk will require discussion as this changes the operating
>>> model, and has significant code consequences when we look at
>>> abnormal
>>> terminations, comm_spawn, etc.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>
>>>> This check in was in error - I had not realized that the checkout
>>>> was from
>>>> the 1.3 branch, so we will fix this, and put these into the trunk
>>>> (1.4). We
>>>> are going to bring in some limited multi-cluster support - limited
>>>> is the
>>>> operative word.
>>>>
>>>> Rich
>>>>
>>>>
>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>
>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>> list; I
>>>>> added him explicitly to the CC line.
>>>>>
>>>>> Ken: please see below.
>>>>>
>>>>>
>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>
>>>>>> Whoa! We made a decision NOT to support multi-cluster apps in
>>>>>> OMPI
>>>>>> over a year ago!
>>>>>>
>>>>>> Please remove this from 1.3 - we should discuss if/when this
>>>>>> would
>>>>>> even be allowed in the trunk.
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>>
>>>>>>> Author: matney
>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>> New Revision: 19600
>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>
>>>>>>> Log:
>>>>>>> Added member to orte_node_t to enable multi-cluster jobs in ALPS
>>>>>>> scheduled systems (like Cray XT).
>>>>>>>
>>>>>>> Text files modified:
>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>
>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> =
>>>>>>> ================================================================
>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>> 12:35:54
>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>> @@ -222,6 +222,10 @@
>>>>>>> /** Username on this node, if specified */
>>>>>>> char *username;
>>>>>>> char *slot_list;
>>>>>>> + /** Clustername (machine name of cluster) on which this
>>>>>>> node
>>>>>>> + resides. ALPS scheduled systems need this to enable
>>>>>>> + multi-cluster support. */
>>>>>>> + char *clustername;
>>>>>>> } orte_node_t;
>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> svn mailing list
>>>>>>> svn_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>