Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: Richard Graham (rlgraham_at_[hidden])
Date: 2008-09-23 10:12:20


Let me make the point that adding a data structure is much less
destabilization to the tree than the routine day-to-day changes that go on
in the tree.

Rich

On 9/23/08 6:24 AM, "Terry D. Dontje" <Terry.Dontje_at_[hidden]> wrote:

> Jeff Squyres wrote:
>> > I think the point is that as a group, we consciously, deliberately,
>> > and painfully decided not to support multi-cluster. And as a result,
>> > we ripped out a lot of supporting code. Starting down this path again
>> > will likely result in a) re-opening all the discussions, b) re-adding
>> > a lot of code (or code effectively similar to what was there before).
>> > Let's not forget that there were many unsolved problems surrounding
>> > multi-cluster last time, too.
>> >
>> > It was also pointed out in Ralph's mails that, at least from the
>> > descriptions provided, adding the field in orte_node_t does not
>> > actually solve the problem that ORNL is trying to solve.
>> >
>> > If we, as a group, decide to re-add all this stuff, then a) recognize
>> > that we are flip-flopping *again* on this issue, and b) it will take a
>> > lot of coding effort to do so. I do think that since this was a group
>> > decision last time, it should be a group decision this time, too. If
>> > this does turn out to be as large of a sub-project as described, I
>> > would be opposed to the development occurring on the trunk; hg trees
>> > are perfect for this kind of stuff.
>> >
>> > I personally have no customers who are doing cross-cluster kinds of
>> > things, so I don't personally care if cross-cluster functionality
>> > works its way [back] in. But I recognize that OMPI core members are
>> > investigating it. So the points I'm making are procedural; I have no
>> > real dog in this fight...
>> >
>> >
> I agree with Jeff that this is perfect for an hg tree. Though I also
> don't have a dog in this fight but I have a cat that would rather stay
> comfortably sleeping and not have someone step on its tail :-). In
> other words knock yourself out but please don't destabilize the trunk.
> Of course that begs the question what happens when the hg tree is done
> and working?
>
> --td
>
>> > On Sep 22, 2008, at 4:40 PM, George Bosilca wrote:
>> >
>>> >> Ralph,
>>> >>
>>> >> There is NO need to have this discussion again, it was painful enough
>>> >> last time. From my perspective I do not understand why are you making
>>> >> so much noise on this one. How a 4 lines change in some ALPS specific
>>> >> files (Cray system very specific to ORNL) can generate more than 3 A4
>>> >> pages of emails, is still something out of my perception.
>>> >>
>>> >> If they want to do multi-cluster and they do not break anything in
>>> >> ORTE/OMPI and they do not ask other people to do it for them why
>>> >> trying to stop them ?
>>> >>
>>> >> george.
>>> >>
>>> >> On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:
>>> >>
>>>> >>> There was a very long drawn-out discussion about this early in 2007.
>>>> >>> Rather than rehash all that, I'll try to summarize it here. It may
>>>> >>> get confusing - it helped a whole lot to be in a room with a
>>>> >>> whiteboard. There were also presentations on the subject - I believe
>>>> >>> the slides may still be in the docs repository.
>>>> >>>
>>>> >>> Because terminology quickly gets confusing, we adopted a slightly
>>>> >>> different one for these discussions. We talk about OMPI being a
>>>> >>> "single cell" system - i.e., jobs executed via mpirun can only span
>>>> >>> nodes that are reachable by that mpirun. In a typical managed
>>>> >>> environment, a cell aligns quite well with a "cluster". In an
>>>> >>> unmanaged environment where the user provides a hostfile, the cell
>>>> >>> will contain all nodes specified in the hostfile.
>>>> >>>
>>>> >>> We don't filter or abort for non-matching hostnames - if mpirun can
>>>> >>> launch on that node, then great. What we don't support is asking
>>>> >>> mpirun to remotely execute another mpirun on the frontend of another
>>>> >>> cell in order to launch procs on the nodes in -that- cell, nor do we
>>>> >>> ask mpirun to in any way manage (or even know about) any procs
>>>> >>> running on a remote cell.
>>>> >>>
>>>> >>> I see what you are saying about the ALPS node name. However, the
>>>> >>> field you want to add doesn't have anything to do with
>>>> >>> accept/connect. The orte_node_t object is used solely by mpirun to
>>>> >>> keep track of the node pool it controls - i.e., the nodes upon which
>>>> >>> it is launching jobs. Thus, the mpirun on cluster A will have
>>>> >>> "nidNNNN" entries it got from its allocation, and the mpirun on
>>>> >>> cluster B will have "nidNNNN" entries it got from its allocation -
>>>> >>> but the two mpiruns will never exchange that information, nor will
>>>> >>> the mpirun on cluster A ever have a need to know the node entries
>>>> >>> for cluster B. Each mpirun launches and manages procs -only- on the
>>>> >>> nodes in its own allocation.
>>>> >>>
>>>> >>> I agree you will have issues when doing the connect/accept modex as
>>>> >>> the nodenames are exchanged and are no longer unique in your
>>>> >>> scenario. However, that info stays in the ompi_proc_t - it never
>>>> >>> gets communicated to the ORTE layer as we couldn't care less down
>>>> >>> there about the remote procs since they are under the control of a
>>>> >>> different mpirun. So if you need to add a cluster id field for this
>>>> >>> purpose, it needs to go in ompi_proc_t - not in the orte structures.
>>>> >>>
>>>> >>> And for that, you probably need to discuss it with the MPI team as
>>>> >>> changes to ompi_proc_t will likely generate considerable discussion.
>>>> >>>
>>>> >>> FWIW: this is one reason I warned Galen about the problems in
>>>> >>> reviving multi-cluster operations again. We used to deal with
>>>> >>> multi-cells in the process name itself, but all that support has
>>>> >>> been removed from OMPI.
>>>> >>>
>>>> >>> Hope that helps
>>>> >>> Ralph
>>>> >>>
>>>> >>> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>>>> >>>
>>>>> >>>> I may be opening a can of worms...
>>>>> >>>>
>>>>> >>>> But, what prevents a user from running across clusters in a "normal
>>>>> >>>> OMPI", i.e., non-ALPS environment? When he puts hosts into his
>>>>> >>>> hostfile, does it parse and abort/filter non-matching hostnames? The
>>>>> >>>> problem for ALPS based systems is that nodes are addressed via
>>>>> NID,PID
>>>>> >>>> pairs at the portals level. Thus, these are unique only within a
>>>>> >>>> cluster. In point of fact, I could rewrite all of the ALPS support
to
>>>>> >>>> identify the nodes by "cluster_id".NID. It would be a bit
>>>>> inefficient
>>>>> >>>> within a cluster because, we would have to extract the NID from this
>>>>> >>>> syntax as we go down to the portals layer. It also would lead to a
>>>>> >>>> larger degree of change within the OMPI ALPS code base. However, I
>>>>> >>>> can
>>>>> >>>> give ALPS-based systems the same feature set as the rest of the >>>>>
world.
>>>>> >>>> It just is more efficient to use an additional pointer in the
>>>>> >>>> orte_node_t structure and results is a far simpler code structure.
>>>>> >>>> This
>>>>> >>>> makes it easier to maintain.
>>>>> >>>>
>>>>> >>>> The only thing that "this change" really does is to identify the
>>>>> >>>> cluster
>>>>> >>>> under which the ALPS allocation is made. If you are addressing a
node
>>>>> >>>> in another cluster, (e.g., via accept/connect), the clustername/NID
>>>>> >>>> pair
>>>>> >>>> is unique for ALPS as a hostname on a cluster node is unique between
>>>>> >>>> clusters. If you do a gethostname() on a normal cluster node, you
are
>>>>> >>>> going to get mynameNNNNN, or something similar. If you do a
>>>>> >>>> gethostname() on an ALPS node, you are going to get nidNNNNN; there
is
>>>>> >>>> no differentiation between cluster A and cluster B.
>>>>> >>>>
>>>>> >>>> Perhaps, my earlier comment was not accurate. In reality, it
>>>>> provides
>>>>> >>>> the same degree of identification for ALPS nodes as hostname provides
>>>>> >>>> for normal clusters. From your perspective, it is immaterial that it
>>>>> >>>> also would allow us to support our limited form of multi-cluster
>>>>> >>>> support. However, of and by itself, it only provides the same
>>>>> >>>> level of
>>>>> >>>> identification as is done for other cluster nodes.
>>>>> >>>> --
>>>>> >>>> Ken
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> -----Original Message-----
>>>>> >>>> From: Ralph Castain [mailto:rhc_at_[hidden]]
>>>>> >>>> Sent: Monday, September 22, 2008 2:33 PM
>>>>> >>>> To: Open MPI Developers
>>>>> >>>> Cc: Matney Sr, Kenneth D.
>>>>> >>>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>>>> >>>>
>>>>> >>>> The issue isn't with adding a string. The question is whether or not
>>>>> >>>> OMPI is to support one job running across multiple clusters. We made
a
>>>>> >>>> conscious decision (after lengthy discussions on OMPI core and ORTE
>>>>> >>>> mailing lists, plus several telecons) to not do so - we require that
>>>>> >>>> the job execute on a single cluster, while allowing connect/accept to
>>>>> >>>> occur between jobs on different clusters.
>>>>> >>>>
>>>>> >>>> It is difficult to understand why we need a string (or our old "cell
>>>>> >>>> id") to tell us which cluster we are on if we are only following that
>>>>> >>>> operating model. From the commit comment, and from what I know of the
>>>>> >>>> system, the only rationale for adding such a designator is to shift
>>>>> >>>> back to the one-mpirun-spanning-multiple-cluster model.
>>>>> >>>>
>>>>> >>>> If we are now going to make that change, then it merits a similar
>>>>> >>>> level of consideration as the last decision to move away from that
>>>>> >>>> model. Making that move involves considerably more than just adding a
>>>>> >>>> cluster id string. You may think that now, but the next step is
>>>>> >>>> inevitably to bring back remote launch, killing jobs on all clusters
>>>>> >>>> when one cluster has a problem, etc.
>>>>> >>>>
>>>>> >>>> Before we go down this path and re-open Pandora's box, we should at
>>>>> >>>> least agree that is what we intend to do...or agree on what hard
>>>>> >>>> constraints we will place on multi-cluster operations. Frankly, I'm
>>>>> >>>> tired of bouncing back-and-forth on even the most basic design
>>>>> >>>> decisions.
>>>>> >>>>
>>>>> >>>> Ralph
>>>>> >>>>
>>>>> >>>>
>>>>> >>>>
>>>>> >>>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>>>> >>>>
>>>>>> >>>>> What Ken put in is what is needed for the limited multi-cluster
>>>>>> >>>>> capabilities
>>>>>> >>>>> we need, just one additional string. I don't think there is a need
>>>>>> >>>>> for any
>>>>>> >>>>> discussion of such a small change.
>>>>>> >>>>>
>>>>>> >>>>> Rich
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>>>> >>>>>
>>>>>>> >>>>>> We really should discuss that as a group first - there is quite a
>>>>>>> >>>>>> bit
>>>>>>> >>>>>> of code required to actually support multi-clusters that has been
>>>>>>> >>>>>> removed.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Our operational model that was agreed to quite a while ago is
that
>>>>>>> >>>>>> mpirun can -only- extend over a single "cell". You can
>>>>>>> >>>>>> connect/accept
>>>>>>> >>>>>> multiple mpiruns that are sitting on different cells, but you
cannot
>>>>>>> >>>>>> execute a single mpirun across multiple cells.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Please keep this on your own development branch for now. Bringing
it
>>>>>>> >>>>>> into the trunk will require discussion as this changes the
>>>>>>> operating
>>>>>>> >>>>>> model, and has significant code consequences when we look at
>>>>>>> >>>>>> abnormal
>>>>>>> >>>>>> terminations, comm_spawn, etc.
>>>>>>> >>>>>>
>>>>>>> >>>>>> Thanks
>>>>>>> >>>>>> Ralph
>>>>>>> >>>>>>
>>>>>>> >>>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>>>> >>>>>>
>>>>>>>> >>>>>>> This check in was in error - I had not realized that the
>>>>>>>> checkout
>>>>>>>> >>>>>>> was from
>>>>>>>> >>>>>>> the 1.3 branch, so we will fix this, and put these into the
trunk
>>>>>>>> >>>>>>> (1.4). We
>>>>>>>> >>>>>>> are going to bring in some limited multi-cluster support -
limited
>>>>>>>> >>>>>>> is the
>>>>>>>> >>>>>>> operative word.
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> Rich
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>>>> >>>>>>>
>>>>>>>>> >>>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>>>>> >>>>>>>> list; I
>>>>>>>>> >>>>>>>> added him explicitly to the CC line.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> Ken: please see below.
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>>
>>>>>>>>> >>>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>>>> >>>>>>>>
>>>>>>>>>> >>>>>>>>> Whoa! We made a decision NOT to support multi-cluster apps
in
>>>>>>>>>> >>>>>>>>> OMPI
>>>>>>>>>> >>>>>>>>> over a year ago!
>>>>>>>>>> >>>>>>>>>
>>>>>>>>>> >>>>>>>>> Please remove this from 1.3 - we should discuss if/when
this
>>>>>>>>>> >>>>>>>>> would
>>>>>>>>>> >>>>>>>>> even be allowed in the trunk.
>>>>>>>>>> >>>>>>>>>
>>>>>>>>>> >>>>>>>>> Thanks
>>>>>>>>>> >>>>>>>>> Ralph
>>>>>>>>>> >>>>>>>>>
>>>>>>>>>> >>>>>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>>>>>> >>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> Author: matney
>>>>>>>>>>> >>>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>>>>> >>>>>>>>>> New Revision: 19600
>>>>>>>>>>> >>>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> Log:
>>>>>>>>>>> >>>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs
in ALPS
>>>>>>>>>>> >>>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> Text files modified:
>>>>>>>>>>> >>>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>>>>>> >>>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>> =
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> =================================================================
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>>>>> >>>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>>>>> >>>>>>>>>> 12:35:54
>>>>>>>>>>> >>>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>>>>> >>>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>>>>> >>>>>>>>>> /** Username on this node, if specified */
>>>>>>>>>>> >>>>>>>>>> char *username;
>>>>>>>>>>> >>>>>>>>>> char *slot_list;
>>>>>>>>>>> >>>>>>>>>> + /** Clustername (machine name of cluster) on which
this
>>>>>>>>>>> >>>>>>>>>> node
>>>>>>>>>>> >>>>>>>>>> + resides. ALPS scheduled systems need this to
enable
>>>>>>>>>>> >>>>>>>>>> + multi-cluster support. */
>>>>>>>>>>> >>>>>>>>>> + char *clustername;
>>>>>>>>>>> >>>>>>>>>> } orte_node_t;
>>>>>>>>>>> >>>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>>>>> >>>>>>>>>>
>>>>>>>>>>> >>>>>>>>>> _______________________________________________
>>>>>>>>>>> >>>>>>>>>> svn mailing list
>>>>>>>>>>> >>>>>>>>>> svn_at_[hidden]
>>>>>>>>>>> >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>>>> >>>>>>>>>
>>>>>>>>>> >>>>>>>>> _______________________________________________
>>>>>>>>>> >>>>>>>>> devel mailing list
>>>>>>>>>> >>>>>>>>> devel_at_[hidden]
>>>>>>>>>> >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>> >>>>>>>>
>>>>>>>> >>>>>>>
>>>>>>>> >>>>>>> _______________________________________________
>>>>>>>> >>>>>>> devel mailing list
>>>>>>>> >>>>>>> devel_at_[hidden]
>>>>>>>> >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> >>>>>>
>>>>>>> >>>>>> _______________________________________________
>>>>>>> >>>>>> devel mailing list
>>>>>>> >>>>>> devel_at_[hidden]
>>>>>>> >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> >>>>>
>>>>>> >>>>> _______________________________________________
>>>>>> >>>>> devel mailing list
>>>>>> >>>>> devel_at_[hidden]
>>>>>> >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> >>>>
>>>> >>>
>>>> >>> _______________________________________________
>>>> >>> devel mailing list
>>>> >>> devel_at_[hidden]
>>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>
>>> >> _______________________________________________
>>> >> devel mailing list
>>> >> devel_at_[hidden]
>>> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >
>> >
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>