Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
From: George Bosilca (bosilca_at_[hidden])
Date: 2008-09-22 16:40:21


Ralph,

There is NO need to have this discussion again, it was painful enough
last time. From my perspective I do not understand why are you making
so much noise on this one. How a 4 lines change in some ALPS specific
files (Cray system very specific to ORNL) can generate more than 3 A4
pages of emails, is still something out of my perception.

If they want to do multi-cluster and they do not break anything in
ORTE/OMPI and they do not ask other people to do it for them why
trying to stop them ?

   george.

On Sep 22, 2008, at 3:59 PM, Ralph Castain wrote:

> There was a very long drawn-out discussion about this early in 2007.
> Rather than rehash all that, I'll try to summarize it here. It may
> get confusing - it helped a whole lot to be in a room with a
> whiteboard. There were also presentations on the subject - I believe
> the slides may still be in the docs repository.
>
> Because terminology quickly gets confusing, we adopted a slightly
> different one for these discussions. We talk about OMPI being a
> "single cell" system - i.e., jobs executed via mpirun can only span
> nodes that are reachable by that mpirun. In a typical managed
> environment, a cell aligns quite well with a "cluster". In an
> unmanaged environment where the user provides a hostfile, the cell
> will contain all nodes specified in the hostfile.
>
> We don't filter or abort for non-matching hostnames - if mpirun can
> launch on that node, then great. What we don't support is asking
> mpirun to remotely execute another mpirun on the frontend of another
> cell in order to launch procs on the nodes in -that- cell, nor do we
> ask mpirun to in any way manage (or even know about) any procs
> running on a remote cell.
>
> I see what you are saying about the ALPS node name. However, the
> field you want to add doesn't have anything to do with accept/
> connect. The orte_node_t object is used solely by mpirun to keep
> track of the node pool it controls - i.e., the nodes upon which it
> is launching jobs. Thus, the mpirun on cluster A will have "nidNNNN"
> entries it got from its allocation, and the mpirun on cluster B will
> have "nidNNNN" entries it got from its allocation - but the two
> mpiruns will never exchange that information, nor will the mpirun on
> cluster A ever have a need to know the node entries for cluster B.
> Each mpirun launches and manages procs -only- on the nodes in its
> own allocation.
>
> I agree you will have issues when doing the connect/accept modex as
> the nodenames are exchanged and are no longer unique in your
> scenario. However, that info stays in the ompi_proc_t - it never
> gets communicated to the ORTE layer as we couldn't care less down
> there about the remote procs since they are under the control of a
> different mpirun. So if you need to add a cluster id field for this
> purpose, it needs to go in ompi_proc_t - not in the orte structures.
>
> And for that, you probably need to discuss it with the MPI team as
> changes to ompi_proc_t will likely generate considerable discussion.
>
> FWIW: this is one reason I warned Galen about the problems in
> reviving multi-cluster operations again. We used to deal with multi-
> cells in the process name itself, but all that support has been
> removed from OMPI.
>
> Hope that helps
> Ralph
>
> On Sep 22, 2008, at 1:39 PM, Matney Sr, Kenneth D. wrote:
>
>> I may be opening a can of worms...
>>
>> But, what prevents a user from running across clusters in a "normal
>> OMPI", i.e., non-ALPS environment? When he puts hosts into his
>> hostfile, does it parse and abort/filter non-matching hostnames? The
>> problem for ALPS based systems is that nodes are addressed via
>> NID,PID
>> pairs at the portals level. Thus, these are unique only within a
>> cluster. In point of fact, I could rewrite all of the ALPS support
>> to
>> identify the nodes by "cluster_id".NID. It would be a bit
>> inefficient
>> within a cluster because, we would have to extract the NID from this
>> syntax as we go down to the portals layer. It also would lead to a
>> larger degree of change within the OMPI ALPS code base. However, I
>> can
>> give ALPS-based systems the same feature set as the rest of the
>> world.
>> It just is more efficient to use an additional pointer in the
>> orte_node_t structure and results is a far simpler code structure.
>> This
>> makes it easier to maintain.
>>
>> The only thing that "this change" really does is to identify the
>> cluster
>> under which the ALPS allocation is made. If you are addressing a
>> node
>> in another cluster, (e.g., via accept/connect), the clustername/NID
>> pair
>> is unique for ALPS as a hostname on a cluster node is unique between
>> clusters. If you do a gethostname() on a normal cluster node, you
>> are
>> going to get mynameNNNNN, or something similar. If you do a
>> gethostname() on an ALPS node, you are going to get nidNNNNN; there
>> is
>> no differentiation between cluster A and cluster B.
>>
>> Perhaps, my earlier comment was not accurate. In reality, it
>> provides
>> the same degree of identification for ALPS nodes as hostname provides
>> for normal clusters. From your perspective, it is immaterial that it
>> also would allow us to support our limited form of multi-cluster
>> support. However, of and by itself, it only provides the same
>> level of
>> identification as is done for other cluster nodes.
>> --
>> Ken
>>
>>
>> -----Original Message-----
>> From: Ralph Castain [mailto:rhc_at_[hidden]]
>> Sent: Monday, September 22, 2008 2:33 PM
>> To: Open MPI Developers
>> Cc: Matney Sr, Kenneth D.
>> Subject: Re: [OMPI devel] [OMPI svn] svn:open-mpi r19600
>>
>> The issue isn't with adding a string. The question is whether or not
>> OMPI is to support one job running across multiple clusters. We
>> made a
>> conscious decision (after lengthy discussions on OMPI core and ORTE
>> mailing lists, plus several telecons) to not do so - we require that
>> the job execute on a single cluster, while allowing connect/accept to
>> occur between jobs on different clusters.
>>
>> It is difficult to understand why we need a string (or our old "cell
>> id") to tell us which cluster we are on if we are only following that
>> operating model. From the commit comment, and from what I know of the
>> system, the only rationale for adding such a designator is to shift
>> back to the one-mpirun-spanning-multiple-cluster model.
>>
>> If we are now going to make that change, then it merits a similar
>> level of consideration as the last decision to move away from that
>> model. Making that move involves considerably more than just adding a
>> cluster id string. You may think that now, but the next step is
>> inevitably to bring back remote launch, killing jobs on all clusters
>> when one cluster has a problem, etc.
>>
>> Before we go down this path and re-open Pandora's box, we should at
>> least agree that is what we intend to do...or agree on what hard
>> constraints we will place on multi-cluster operations. Frankly, I'm
>> tired of bouncing back-and-forth on even the most basic design
>> decisions.
>>
>> Ralph
>>
>>
>>
>> On Sep 22, 2008, at 11:55 AM, Richard Graham wrote:
>>
>>> What Ken put in is what is needed for the limited multi-cluster
>>> capabilities
>>> we need, just one additional string. I don't think there is a need
>>> for any
>>> discussion of such a small change.
>>>
>>> Rich
>>>
>>>
>>> On 9/22/08 1:32 PM, "Ralph Castain" <rhc_at_[hidden]> wrote:
>>>
>>>> We really should discuss that as a group first - there is quite a
>>>> bit
>>>> of code required to actually support multi-clusters that has been
>>>> removed.
>>>>
>>>> Our operational model that was agreed to quite a while ago is that
>>>> mpirun can -only- extend over a single "cell". You can connect/
>>>> accept
>>>> multiple mpiruns that are sitting on different cells, but you
>>>> cannot
>>>> execute a single mpirun across multiple cells.
>>>>
>>>> Please keep this on your own development branch for now. Bringing
>>>> it
>>>> into the trunk will require discussion as this changes the
>>>> operating
>>>> model, and has significant code consequences when we look at
>>>> abnormal
>>>> terminations, comm_spawn, etc.
>>>>
>>>> Thanks
>>>> Ralph
>>>>
>>>> On Sep 22, 2008, at 11:26 AM, Richard Graham wrote:
>>>>
>>>>> This check in was in error - I had not realized that the checkout
>>>>> was from
>>>>> the 1.3 branch, so we will fix this, and put these into the trunk
>>>>> (1.4). We
>>>>> are going to bring in some limited multi-cluster support - limited
>>>>> is the
>>>>> operative word.
>>>>>
>>>>> Rich
>>>>>
>>>>>
>>>>> On 9/22/08 12:50 PM, "Jeff Squyres" <jsquyres_at_[hidden]> wrote:
>>>>>
>>>>>> I notice that Ken Matney (the committer) is not on the devel
>>>>>> list; I
>>>>>> added him explicitly to the CC line.
>>>>>>
>>>>>> Ken: please see below.
>>>>>>
>>>>>>
>>>>>> On Sep 22, 2008, at 12:46 PM, Ralph Castain wrote:
>>>>>>
>>>>>>> Whoa! We made a decision NOT to support multi-cluster apps in
>>>>>>> OMPI
>>>>>>> over a year ago!
>>>>>>>
>>>>>>> Please remove this from 1.3 - we should discuss if/when this
>>>>>>> would
>>>>>>> even be allowed in the trunk.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ralph
>>>>>>>
>>>>>>> On Sep 22, 2008, at 10:35 AM, matney_at_[hidden] wrote:
>>>>>>>
>>>>>>>> Author: matney
>>>>>>>> Date: 2008-09-22 12:35:54 EDT (Mon, 22 Sep 2008)
>>>>>>>> New Revision: 19600
>>>>>>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/19600
>>>>>>>>
>>>>>>>> Log:
>>>>>>>> Added member to orte_node_t to enable multi-cluster jobs in
>>>>>>>> ALPS
>>>>>>>> scheduled systems (like Cray XT).
>>>>>>>>
>>>>>>>> Text files modified:
>>>>>>>> branches/v1.3/orte/runtime/orte_globals.h | 4 ++++
>>>>>>>> 1 files changed, 4 insertions(+), 0 deletions(-)
>>>>>>>>
>>>>>>>> Modified: branches/v1.3/orte/runtime/orte_globals.h
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> =
>>>>>>>> ===============================================================
>>>>>>>> --- branches/v1.3/orte/runtime/orte_globals.h (original)
>>>>>>>> +++ branches/v1.3/orte/runtime/orte_globals.h 2008-09-22
>>>>>>>> 12:35:54
>>>>>>>> EDT (Mon, 22 Sep 2008)
>>>>>>>> @@ -222,6 +222,10 @@
>>>>>>>> /** Username on this node, if specified */
>>>>>>>> char *username;
>>>>>>>> char *slot_list;
>>>>>>>> + /** Clustername (machine name of cluster) on which this
>>>>>>>> node
>>>>>>>> + resides. ALPS scheduled systems need this to enable
>>>>>>>> + multi-cluster support. */
>>>>>>>> + char *clustername;
>>>>>>>> } orte_node_t;
>>>>>>>> ORTE_DECLSPEC OBJ_CLASS_DECLARATION(orte_node_t);
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> svn mailing list
>>>>>>>> svn_at_[hidden]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/svn
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> devel_at_[hidden]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel