Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: Terry Dontje (Terry.Dontje_at_[hidden])
Date: 2008-09-23 06:24:47

Jeff Squyres wrote:
> I think the point is that as a group, we consciously, deliberately,
> and painfully decided not to support multi-cluster. And as a result,
> we ripped out a lot of supporting code. Starting down this path again
> will likely result in a) re-opening all the discussions, b) re-adding
> a lot of code (or code effectively similar to what was there before).
> Let's not forget that there were many unsolved problems surrounding
> multi-cluster last time, too.
> It was also pointed out in Ralph's mails that, at least from the
> descriptions provided, adding the field in orte_node_t does not
> actually solve the problem that ORNL is trying to solve.
> If we, as a group, decide to re-add all this stuff, then a) recognize
> that we are flip-flopping *again* on this issue, and b) it will take a
> lot of coding effort to do so. I do think that since this was a group
> decision last time, it should be a group decision this time, too. If
> this does turn out to be as large of a sub-project as described, I
> would be opposed to the development occurring on the trunk; hg trees
> are perfect for this kind of stuff.
> I personally have no customers who are doing cross-cluster kinds of
> things, so I don't personally care if cross-cluster functionality
> works its way [back] in. But I recognize that OMPI core members are
> investigating it. So the points I'm making are procedural; I have no
> real dog in this fight...
I agree with Jeff that this is perfect for an hg tree. Though I also
don't have a dog in this fight but I have a cat that would rather stay
comfortably sleeping and not have someone step on its tail :-). In
other words knock yourself out but please don't destabilize the trunk.
Of course that begs the question what happens when the hg tree is done
and working?


> On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:
>> Ralph,
>> There is NO need to have this discussion again, it was painful enough
>> last time. From my perspective I do not understand why are you making
>> so much noise on this one. How a 4 lines change in some ALPS specific
>> files (Cray system very specific to ORNL) can generate more than 3 A4
>> pages of emails, is still something out of my perception.
>> If they want to do multi-cluster and they do not break anything in
>> ORTE/OMPI and they do not ask other people to do it for them why
>> trying to stop them ?
>> george.
>> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
>>> There was a very long drawn-out discussion about this early in 2007.
>>> Rather than rehash all that, I'll try to summarize it here. It may
>>> get confusing - it helped a whole lot to be in a room with a
>>> whiteboard. There were also presentations on the subject - I believe
>>> the slides may still be in the docs repository.
>>> Because terminology quickly gets confusing, we adopted a slightly
>>> different one for these discussions. We talk about OMPI being a
>>> "single cell" system - i.e., jobs executed via mpirun can only span
>>> nodes that are reachable by that mpirun. In a typical managed
>>> environment, a cell aligns quite well with a "cluster". In an
>>> unmanaged environment where the user provides a hostfile, the cell
>>> will contain all nodes specified in the hostfile.
>>> We don't filter or abort for non-matching hostnames - if mpirun can
>>> launch on that node, then great. What we don't support is asking
>>> mpirun to remotely execute another mpirun on the frontend of another
>>> cell in order to launch procs on the nodes in -that- cell, nor do we
>>> ask mpirun to in any way manage (or even know about) any procs
>>> running on a remote cell.
>>> I see what you are saying about the ALPS node name. However, the
>>> field you want to add doesn't have anything to do with
>>> accept/connect. The orte_node_t object is used solely by mpirun to
>>> keep track of the node pool it controls - i.e., the nodes upon which
>>> it is launching jobs. Thus, the mpirun on cluster A will have
>>> "nidNNNN" entries it got from its allocation, and the mpirun on
>>> cluster B will have "nidNNNN" entries it got from its allocation -
>>> but the two mpiruns will never exchange that information, nor will
>>> the mpirun on cluster A ever have a need to know the node entries
>>> for cluster B. Each mpirun launches and manages procs -only- on the
>>> nodes in its own allocation.
>>> I agree you will have issues when doing the connect/accept modex as
>>> the nodenames are exchanged and are no longer unique in your
>>> scenario. However, that info stays in the ompi_proc_t - it never
>>> gets communicated to the ORTE layer as we couldn't care less down
>>> there about the remote procs since they are under the control of a
>>> different mpirun. So if you need to add a cluster id field for this
>>> purpose, it needs to go in ompi_proc_t - not in the orte structures.
>>> And for that, you probably need to discuss it with the MPI team as
>>> changes to ompi_proc_t will likely generate considerable discussion.
>>> FWIW: this is one reason I warned Galen about the problems in
>>> reviving multi-cluster operations again. We used to deal with
>>> multi-cells in the process name itself, but all that support has
>>> been removed from OMPI.
>>> Hope that helps
>>> Ralph
>>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>>>> I may be opening a can of worms...
>>>> But, what prevents a user from running across clusters in a "normal
>>>> OMPI", i.e., non-ALPS environment? When he puts hosts into his
>>>> hostfile, does it parse and abort/filter non-matching hostnames? The
>>>> problem for ALPS based systems is that nodes are addressed via NID,PID
>>>> pairs at the portals level. Thus, these are unique only within a
>>>> cluster. In point of fact, I could rewrite all of the ALPS support to
>>>> identify the nodes by "cluster_id".NID. It would be a bit inefficient
>>>> within a cluster because, we would have to extract the NID from this
>>>> syntax as we go down to the portals layer. It also would lead to a
>>>> larger degree of change within the OMPI ALPS code base. However, I
>>>> can
>>>> give ALPS-based systems the same feature set as the rest of the world.
>>>> It just is more efficient to use an additional pointer in the
>>>> orte_node_t structure and results is a far simpler code structure.
>>>> This
>>>> makes it easier to maintain.
>>>> The only thing that "this change" really does is to identify the
>>>> cluster
>>>> under which the ALPS allocation is made. If you are addressing a node
>>>> in another cluster, (e.g., via accept/connect), the clustername/NID
>>>> pair
>>>> is unique for ALPS as a hostname on a cluster node is unique between
>>>> clusters. If you do a gethostname() on a normal cluster node, you are
>>>> going to get mynameNNNNN, or something similar. If you do a
>>>> gethostname() on an ALPS node, you are going to get nidNNNNN; there is
>>>> no differentiation between cluster A and cluster B.
>>>> Perhaps, my earlier comment was not accurate. In reality, it provides
>>>> the same degree of identification for ALPS nodes as hostname provides
>>>> for normal clusters. From your perspective, it is immaterial that it
>>>> also would allow us to support our limited form of multi-cluster
>>>> support. However, of and by itself, it only provides the same
>>>> level of
>>>> identification as is done for other cluster nodes.
>>>> --
>>>> Ken
>>>> -----Original Message-----
>>>> From: Ralph Castain [mailto:rhc_at_[hidden]]
>>>> Sent: Monday, September 22, 2008 2:33 PM
>>>> To: Open MPI Developers
>>>> Cc: Matney Sr, Kenneth D.
>>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>>> The issue isn't with adding a string. The question is whether or not
>>>> OMPI is to support one job running across multiple clusters. We made a
>>>> conscious decision (after lengthy discussions on OMPI core and ORTE
>>>> mailing lists, plus several telecons) to not do so - we require that
>>>> the job execute on a single cluster, while allowing connect/accept to
>>>> occur between jobs on different clusters.
>>>> It is difficult to understand why we need a string (or our old "cell
>>>> id") to tell us which cluster we are on if we are only following that
>>>> operating model. From the commit comment, and from what I know of the
>>>> system, the only rationale for adding such a designator is to shift
>>>> back to the one-mpirun-spanning-multiple-cluster model.
>>>> If we are now going to make that change, then it merits a similar
>>>> level of consideration as the last decision to move away from that
>>>> model. Making that move involves considerably more than just adding a
>>>> cluster id string. You may think that now, but the next step is
>>>> inevitably to bring back remote launch, killing jobs on all clusters
>>>> when one cluster has a problem, etc.
>>>> Before we go down this path and re-open Pandora's box, we should at
>>>> least agree that is what we intend to do...or agree on what hard
>>>> constraints we will place on multi-cluster operations. Frankly, I'm
>>>> tired of bouncing back-and-forth on even the most basic design
>>>> decisions.
>>>> Ralph
>>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>>>> What Ken put in is what is needed for the limited multi-cluster
>>>>> capabilities
>>>>> we need, just one additional string. I don't think there is a need
>>>>> for any
>>>>> discussion of such a small change.
>>>>> Rich
>>>>> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>> We really should discuss that as a group first - there is quite a
>>>>>> bit
>>>>>> of code required to actually support multi-clusters that has been
>>>>>> removed.
>>>>>> Our operational model that was agreed to quite a while ago is that
>>>>>> mpirun can -only- extend over a single "cell". You can
>>>>>> connect/accept
>>>>>> multiple mpiruns that are sitting on different cells, but you cannot
>>>>>> execute a single mpirun across multiple cells.
>>>>>> Please keep this on your own development branch for now. Bringing it
>>>>>> into the trunk will require discussion as this changes the operating
>>>>>> model, and has significant code consequences when we look at
>>>>>> abnormal
>>>>>> terminations, comm_spawn, etc.
>>>>>> Thanks
>>>>>> Ralph
>>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>>>> This check in was in error - I had not realized that the checkout
>>>>>>> was from
>>>>>>> the 1.3 branch, so we will fix this, and put these into the trunk
>>>>>>> (1.4). We
>>>>>>> are going to bring in some limited multi-cluster support - limited
>>>>>>> is the
>>>>>>> operative word.
>>>>>>> Rich
>>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>>>> list; I
>>>>>>>> added him explicitly to the CC line.
>>>>>>>> Ken: please see below.
>>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>>>> Whoa! We made a decision NOT to support multi-cluster apps in
>>>>>>>>> OMPI
>>>>>>>>> over a year ago!
>>>>>>>>> Please remove this from 1.3 - we should discuss if/when this
>>>>>>>>> would
>>>>>>>>> even be allowed in the trunk.
>>>>>>>>> Thanks
>>>>>>>>> Ralph
>>>>>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>>>>>> Author: matney
>>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>>>> New Revision: 19600
>>>>>>>>>> URL:
>>>>>>>>>> Log:
>>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs in ALPS
>>>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>>> Text files modified:
>>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =
>>>>>>>>>> =================================================================
>>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>>>> 12:35:54
>>>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>>>> /** Username on this node, if specified */
>>>>>>>>>> char *username;
>>>>>>>>>> char *slot_list;
>>>>>>>>>> + /** Clustername (machine name of cluster) on which this
>>>>>>>>>> node
>>>>>>>>>> + resides. ALPS scheduled systems need this to enable
>>>>>>>>>> + multi-cluster support. */
>>>>>>>>>> + char *clustername;
>>>>>>>>>> } orte_node_t;
>>>>>>>>>> _______________________________________________
>>>>>>>>>> svn mailing list
>>>>>>>>>> svn_at_[hidden]
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> devel_at_[hidden]
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]