Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: Ralph Castain (rhc_at_[hidden])
Date: 2008-09-22 14:33:20


The issue isn't with adding a string. The question is whether or not
OMPI is to support one job running across multiple clusters. We made a
conscious decision (after lengthy discussions on OMPI core and ORTE
mailing lists, plus several telecons) to not do so - we require that
the job execute on a single cluster, while allowing connect/accept to
occur between jobs on different clusters.

It is difficult to understand why we need a string (or our old "cell
id") to tell us which cluster we are on if we are only following that
operating model. From the commit comment, and from what I know of the
system, the only rationale for adding such a designator is to shift
back to the one-mpirun-spanning-multiple-cluster model.

If we are now going to make that change, then it merits a similar
level of consideration as the last decision to move away from that
model. Making that move involves considerably more than just adding a
cluster id string. You may think that now, but the next step is
inevitably to bring back remote launch, killing jobs on all clusters
when one cluster has a problem, etc.

Before we go down this path and re-open Pandora's box, we should at
least agree that is what we intend to do...or agree on what hard
constraints we will place on multi-cluster operations. Frankly, I'm
tired of bouncing back-and-forth on even the most basic design
decisions.

Ralph

On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:

> What Ken put in is what is needed for the limited multi-cluster
> capabilities
> we need, just one additional string. I don't think there is a need
> for any
> discussion of such a small change.
>
> Rich
>
>
> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>
>> We really should discuss that as a group first - there is quite a bit
>> of code required to actually support multi-clusters that has been
>> removed.
>>
>> Our operational model that was agreed to quite a while ago is that
>> mpirun can -only- extend over a single "cell". You can connect/accept
>> multiple mpiruns that are sitting on different cells, but you cannot
>> execute a single mpirun across multiple cells.
>>
>> Please keep this on your own development branch for now. Bringing it
>> into the trunk will require discussion as this changes the operating
>> model, and has significant code consequences when we look at abnormal
>> terminations, comm_spawn, etc.
>>
>> Thanks
>> Ralph
>>
>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>
>>> This check in was in error - I had not realized that the checkout
>>> was from
>>> the 1.3 branch, so we will fix this, and put these into the trunk
>>> (1.4). We
>>> are going to bring in some limited multi-cluster support - limited
>>> is the
>>> operative word.
>>>
>>> Rich
>>>
>>>
>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>
>>>> I notice that Ken Matney (the committer) is not on the devel
>>>> list; I
>>>> added him explicitly to the CC line.
>>>>
>>>> Ken: please see below.
>>>>
>>>>
>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>
>>>>> Whoa! We made a decision NOT to support multi-cluster apps in OMPI
>>>>> over a year ago!
>>>>>
>>>>> Please remove this from 1.3 - we should discuss if/when this would
>>>>> even be allowed in the trunk.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>
>>>>>> Author: matney
>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>> New Revision: 19600
>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>
>>>>>> Log:
>>>>>> Added member to orte_node_t to enable multi-cluster jobs in ALPS
>>>>>> scheduled systems (like Cray XT).
>>>>>>
>>>>>> Text files modified:
>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>
>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =
>>>>>> =================================================================
>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22 12:35:54
>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>> @@ -222,6 +222,10 @@
>>>>>> /** Username on this node, if specified */
>>>>>> char *username;
>>>>>> char *slot_list;
>>>>>> + /** Clustername (machine name of cluster) on which this node
>>>>>> + resides. ALPS scheduled systems need this to enable
>>>>>> + multi-cluster support. */
>>>>>> + char *clustername;
>>>>>> } orte_node_t;
>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>
>>>>>> _______________________________________________
>>>>>> svn mailing list
>>>>>> svn_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel