Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2008-09-22 16:54:25


I think the point is that as a group, we consciously, deliberately,
and painfully decided not to support multi-cluster. And as a result,
we ripped out a lot of supporting code. Starting down this path again
will likely result in a) re-opening all the discussions, b) re-adding
a lot of code (or code effectively similar to what was there before).
Let's not forget that there were many unsolved problems surrounding
multi-cluster last time, too.

It was also pointed out in Ralph's mails that, at least from the
descriptions provided, adding the field in orte_node_t does not
actually solve the problem that ORNL is trying to solve.

If we, as a group, decide to re-add all this stuff, then a) recognize
that we are flip-flopping *again* on this issue, and b) it will take a
lot of coding effort to do so. I do think that since this was a group
decision last time, it should be a group decision this time, too. If
this does turn out to be as large of a sub-project as described, I
would be opposed to the development occurring on the trunk; hg trees
are perfect for this kind of stuff.

I personally have no customers who are doing cross-cluster kinds of
things, so I don't personally care if cross-cluster functionality
works its way [back] in. But I recognize that OMPI core members are
investigating it. So the points I'm making are procedural; I have no
real dog in this fight...

On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:

> Ralph,
>
> There is NO need to have this discussion again, it was painful
> enough last time. From my perspective I do not understand why are
> you making so much noise on this one. How a 4 lines change in some
> ALPS specific files (Cray system very specific to ORNL) can generate
> more than 3 A4 pages of emails, is still something out of my
> perception.
>
> If they want to do multi-cluster and they do not break anything in
> ORTE/OMPI and they do not ask other people to do it for them why
> trying to stop them ?
>
> george.
>
> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
>
>> There was a very long drawn-out discussion about this early in
>> 2007. Rather than rehash all that, I'll try to summarize it here.
>> It may get confusing - it helped a whole lot to be in a room with a
>> whiteboard. There were also presentations on the subject - I
>> believe the slides may still be in the docs repository.
>>
>> Because terminology quickly gets confusing, we adopted a slightly
>> different one for these discussions. We talk about OMPI being a
>> "single cell" system - i.e., jobs executed via mpirun can only span
>> nodes that are reachable by that mpirun. In a typical managed
>> environment, a cell aligns quite well with a "cluster". In an
>> unmanaged environment where the user provides a hostfile, the cell
>> will contain all nodes specified in the hostfile.
>>
>> We don't filter or abort for non-matching hostnames - if mpirun can
>> launch on that node, then great. What we don't support is asking
>> mpirun to remotely execute another mpirun on the frontend of
>> another cell in order to launch procs on the nodes in -that- cell,
>> nor do we ask mpirun to in any way manage (or even know about) any
>> procs running on a remote cell.
>>
>> I see what you are saying about the ALPS node name. However, the
>> field you want to add doesn't have anything to do with accept/
>> connect. The orte_node_t object is used solely by mpirun to keep
>> track of the node pool it controls - i.e., the nodes upon which it
>> is launching jobs. Thus, the mpirun on cluster A will have
>> "nidNNNN" entries it got from its allocation, and the mpirun on
>> cluster B will have "nidNNNN" entries it got from its allocation -
>> but the two mpiruns will never exchange that information, nor will
>> the mpirun on cluster A ever have a need to know the node entries
>> for cluster B. Each mpirun launches and manages procs -only- on the
>> nodes in its own allocation.
>>
>> I agree you will have issues when doing the connect/accept modex as
>> the nodenames are exchanged and are no longer unique in your
>> scenario. However, that info stays in the ompi_proc_t - it never
>> gets communicated to the ORTE layer as we couldn't care less down
>> there about the remote procs since they are under the control of a
>> different mpirun. So if you need to add a cluster id field for this
>> purpose, it needs to go in ompi_proc_t - not in the orte structures.
>>
>> And for that, you probably need to discuss it with the MPI team as
>> changes to ompi_proc_t will likely generate considerable discussion.
>>
>> FWIW: this is one reason I warned Galen about the problems in
>> reviving multi-cluster operations again. We used to deal with multi-
>> cells in the process name itself, but all that support has been
>> removed from OMPI.
>>
>> Hope that helps
>> Ralph
>>
>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>>
>>> I may be opening a can of worms...
>>>
>>> But, what prevents a user from running across clusters in a "normal
>>> OMPI", i.e., non-ALPS environment? When he puts hosts into his
>>> hostfile, does it parse and abort/filter non-matching hostnames?
>>> The
>>> problem for ALPS based systems is that nodes are addressed via
>>> NID,PID
>>> pairs at the portals level. Thus, these are unique only within a
>>> cluster. In point of fact, I could rewrite all of the ALPS
>>> support to
>>> identify the nodes by "cluster_id".NID. It would be a bit
>>> inefficient
>>> within a cluster because, we would have to extract the NID from this
>>> syntax as we go down to the portals layer. It also would lead to a
>>> larger degree of change within the OMPI ALPS code base. However,
>>> I can
>>> give ALPS-based systems the same feature set as the rest of the
>>> world.
>>> It just is more efficient to use an additional pointer in the
>>> orte_node_t structure and results is a far simpler code
>>> structure. This
>>> makes it easier to maintain.
>>>
>>> The only thing that "this change" really does is to identify the
>>> cluster
>>> under which the ALPS allocation is made. If you are addressing a
>>> node
>>> in another cluster, (e.g., via accept/connect), the clustername/
>>> NID pair
>>> is unique for ALPS as a hostname on a cluster node is unique between
>>> clusters. If you do a gethostname() on a normal cluster node, you
>>> are
>>> going to get mynameNNNNN, or something similar. If you do a
>>> gethostname() on an ALPS node, you are going to get nidNNNNN;
>>> there is
>>> no differentiation between cluster A and cluster B.
>>>
>>> Perhaps, my earlier comment was not accurate. In reality, it
>>> provides
>>> the same degree of identification for ALPS nodes as hostname
>>> provides
>>> for normal clusters. From your perspective, it is immaterial that
>>> it
>>> also would allow us to support our limited form of multi-cluster
>>> support. However, of and by itself, it only provides the same
>>> level of
>>> identification as is done for other cluster nodes.
>>> --
>>> Ken
>>>
>>>
>>> -----Original Message-----
>>> From: Ralph Castain [mailto:rhc_at_[hidden]]
>>> Sent: Monday, September 22, 2008 2:33 PM
>>> To: Open MPI Developers
>>> Cc: Matney Sr, Kenneth D.
>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>>
>>> The issue isn't with adding a string. The question is whether or not
>>> OMPI is to support one job running across multiple clusters. We
>>> made a
>>> conscious decision (after lengthy discussions on OMPI core and ORTE
>>> mailing lists, plus several telecons) to not do so - we require that
>>> the job execute on a single cluster, while allowing connect/accept
>>> to
>>> occur between jobs on different clusters.
>>>
>>> It is difficult to understand why we need a string (or our old "cell
>>> id") to tell us which cluster we are on if we are only following
>>> that
>>> operating model. From the commit comment, and from what I know of
>>> the
>>> system, the only rationale for adding such a designator is to shift
>>> back to the one-mpirun-spanning-multiple-cluster model.
>>>
>>> If we are now going to make that change, then it merits a similar
>>> level of consideration as the last decision to move away from that
>>> model. Making that move involves considerably more than just
>>> adding a
>>> cluster id string. You may think that now, but the next step is
>>> inevitably to bring back remote launch, killing jobs on all clusters
>>> when one cluster has a problem, etc.
>>>
>>> Before we go down this path and re-open Pandora's box, we should at
>>> least agree that is what we intend to do...or agree on what hard
>>> constraints we will place on multi-cluster operations. Frankly, I'm
>>> tired of bouncing back-and-forth on even the most basic design
>>> decisions.
>>>
>>> Ralph
>>>
>>>
>>>
>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>>
>>>> What Ken put in is what is needed for the limited multi-cluster
>>>> capabilities
>>>> we need, just one additional string. I don't think there is a need
>>>> for any
>>>> discussion of such a small change.
>>>>
>>>> Rich
>>>>
>>>>
>>>> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>
>>>>> We really should discuss that as a group first - there is quite
>>>>> a bit
>>>>> of code required to actually support multi-clusters that has been
>>>>> removed.
>>>>>
>>>>> Our operational model that was agreed to quite a while ago is that
>>>>> mpirun can -only- extend over a single "cell". You can connect/
>>>>> accept
>>>>> multiple mpiruns that are sitting on different cells, but you
>>>>> cannot
>>>>> execute a single mpirun across multiple cells.
>>>>>
>>>>> Please keep this on your own development branch for now.
>>>>> Bringing it
>>>>> into the trunk will require discussion as this changes the
>>>>> operating
>>>>> model, and has significant code consequences when we look at
>>>>> abnormal
>>>>> terminations, comm_spawn, etc.
>>>>>
>>>>> Thanks
>>>>> Ralph
>>>>>
>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>>
>>>>>> This check in was in error - I had not realized that the checkout
>>>>>> was from
>>>>>> the 1.3 branch, so we will fix this, and put these into the trunk
>>>>>> (1.4). We
>>>>>> are going to bring in some limited multi-cluster support -
>>>>>> limited
>>>>>> is the
>>>>>> operative word.
>>>>>>
>>>>>> Rich
>>>>>>
>>>>>>
>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>
>>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>>> list; I
>>>>>>> added him explicitly to the CC line.
>>>>>>>
>>>>>>> Ken: please see below.
>>>>>>>
>>>>>>>
>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>>
>>>>>>>> Whoa! We made a decision NOT to support multi-cluster apps in
>>>>>>>> OMPI
>>>>>>>> over a year ago!
>>>>>>>>
>>>>>>>> Please remove this from 1.3 - we should discuss if/when this
>>>>>>>> would
>>>>>>>> even be allowed in the trunk.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Ralph
>>>>>>>>
>>>>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>>>>
>>>>>>>>> Author: matney
>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>>> New Revision: 19600
>>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>>>
>>>>>>>>> Log:
>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs in
>>>>>>>>> ALPS
>>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>>
>>>>>>>>> Text files modified:
>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>>
>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> =
>>>>>>>>> ==============================================================
>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>>> 12:35:54
>>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>>> /** Username on this node, if specified */
>>>>>>>>> char *username;
>>>>>>>>> char *slot_list;
>>>>>>>>> + /** Clustername (machine name of cluster) on which this
>>>>>>>>> node
>>>>>>>>> + resides. ALPS scheduled systems need this to enable
>>>>>>>>> + multi-cluster support. */
>>>>>>>>> + char *clustername;
>>>>>>>>> } orte_node_t;
>>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> svn mailing list
>>>>>>>>> svn_at_[hidden]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> devel_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Jeff Squyres
Cisco Systems