Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

From: Ralph H Castain (rhc_at_[hidden])
Date: 2007-07-26 18:00:39


On 7/26/07 2:24 PM, "Aurelien Bouteiller" <bouteill_at_[hidden]> wrote:

> Ralph H Castain wrote:
>> After some investigation, I'm afraid that I have to report that this - as
>> far as I understand what you are doing - may no longer work in Open MPI in
>> the future (and I'm pretty sure isn't working in the trunk today except
>> [maybe] in the special case of hostfile - haven't verified that).
>>
>> To ensure we are correctly communicating, let me reiterate what I understand
>> you are doing:
>>
> Correct. Also consider that for my testing I use a batch scheduler that
> is not managed by orte right now and provide myself the hostfiles (This
> batch scheduler is named OAR and is in use on the grid5000 research
> facility in France).
>
>> This was caused by mpirun itself processing its local environment and then
>> "pushing" it into the global registry. Keeping everything separated causes a
>> bookkeeper's headache and many lines of code that we would like to
>> eliminate.
>>
>>
> I see the point. I Agree there is very few benefit at allowing users to
> have different local environments on different mpirun instances; while
> it should be a real pain to have a clean code managing this. For my sole
> usage, the app_context feature you described is a more elegant and
> equivalent way of spawning my FT services. I will switch to this right
> away.
>
> Still it might be of some use to be able to start different mpirun the
> same way you plan comm_spawn to work: sharing the same environment, but
> allowing for use of a different hostfile. The use case that comes in
> mind is "grid", where different batch schedulers are in use on each
> clusters, so you can't gather a single hostfile. This is not a feature I
> would fight for, but I can imagine some people might find it useful.

One of the design changes we made was to explicitly not support
multi-cluster operations from inside of Open MPI. Instead, people (not us)
are looking at adding a layer on top of Open MPI to handle the cross-cluster
coordination. I expect you'll hear more about those efforts in the
not-too-distant future.

>
> More important for me is the ability to refill the hostfile with fresh
> hosts when some of the original ones died. Allocating an huge amount of
> spares preventively is just not the correct way to go. On the side I am
> not sure that even the best comm_spawn you discussed could be of some
> help in this case as I do not want the new nodes to go in a different
> COMM_WORLD. Finding a way to update the registry and all the orted to do
> so is a much larger issue than simple spawning and I have not been
> really thinking about it for now. Maybe we should discuss this issue
> separately.

Ah, now -that- is a different topic indeed. I do plan to support a dynamic
add_hosts API as part of the revamped system. I'll try to flesh that out as
a separate RFC later.

Thanks
Ralph

>
> Aurelien
>> Please feel free to comment. If this is a big enough issue to a large enough
>> audience, then we can try to find a way to solve it (assuming Open MPI's
>> community decides to support it).
>>
>> Ralph
>>
>>
>>
>>>>> Next requirement is the ability to add during runtime some nodes to the
>>>>> initial pool. Because node may fail (but it is the same with comm_spawn
>>>>> basically) , I might need some (lot of) spare nodes to replace failed
>>>>> ones. As I do not want to request for twice as many nodes as I need
>>>>> (after all, things could just go fine, why should I waste that many
>>>>> computing resources for idle spares ?), I definitely want to be able to
>>>>> allocate some new nodes to the pool of the already running machines. As
>>>>> far as I understand, this is impossible to achieve with the usecase2 and
>>>>> quite difficult in usecase1. In my opinion, having the ability to spawn
>>>>> on nodes which are not part of the initial hostfile is a key feature
>>>>> (and not only for FT purposes).
>>>>>
>>>>>
>>>>>
>>>>>
>>>> I am looking for more detail into the above issue. What
>>>> resource manager are you using?
>>>>
>>>> Ideally, we would prefer not to support this. Any nodes
>>>> that you run on, or hope to run on, would be designated
>>>> at the start. For example:
>>>>
>>>> mpirun -np 1 --host a,b,c,d,e,f,g
>>>>
>>>> This would cause the one process of the mpi job to start on host a.
>>>> Then, the mpi job has available to it the other hosts should it decide
>>>> later to start a job on them. However no ORTE daemons would
>>>> be started on those nodes until calls to MPI_Comm_spawn
>>>> occur. So, the MPI job would not be consuming any resources
>>>> until called upon to.
>>>>
>>> This has actually been the subject of multiple threads on the user list and
>>> is considered a critical capability by some users and vendors. I believe
>>> there is little problem in allowing those systems that can support it to
>>> dynamically add nodes to ORTE via some API into the resource manager. At the
>>> moment, none of the RMs support it, but LSF will (and TM at least may)
>>> shortly do so, and some of their customers are depending upon it.
>>>
>>> The problem is that job startup could be delayed for significant time if all
>>> hosts must be preallocated. Admittedly, this raises all kinds of issues
>>> about how long the job could be stalled waiting for the new hosts. However,
>>> as the other somewhat exhaustive threads have discussed, there are computing
>>> models that can live with this uncertainty, and RMs that will provide async
>>> callbacks to allow the rest of the app to continue working while waiting.
>>>
>>> Just my $0.00002 - again, this goes back to...are there use-cases and
>>> customers to which Open MPI is simply going to say "we won't support that"?
>>>
>>>
>>>> Rolf
>>>>
>>>>
>>>>> I know there have been some extra discussions on this subject.
>>>>> Unfortunately it looks like I am not part of the list where it happened.
>>>>> I hope those concerns have not been already discussed.
>>>>>
>>>>> Aurelien
>>>>>
>>>>> Ralph H Castain wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Yo all
>>>>>>
>>>>>> As you know, I am working on revamping the hostfile functionality to make
>>>>>> it
>>>>>> work better with managed environments (at the moment, the two are
>>>>>> exclusive). The issue that we need to review is how we want the
>>>>>> interaction
>>>>>> to work, both for the initial launch and for comm_spawn.
>>>>>>
>>>>>> In talking with Jeff, we boiled it down to two options that I have
>>>>>> flow-charted (see attached):
>>>>>>
>>>>>> Option 1: in this mode, we read any allocated nodes provided by a
>>>>>> resource
>>>>>> manager (e.g., SLURM). These nodes establish a base pool of nodes that
>>>>>> can
>>>>>> be used by both the initial launch and any dynamic comm_spawn requests.
>>>>>> The
>>>>>> hostfile and any -host info is then used to select nodes from within that
>>>>>> pool for use with the specific launch. The initial launch would use the
>>>>>> -hostfile or -host command line option to provide that info - comm_spawn
>>>>>> would use the MPI_Info fields to provide similar info.
>>>>>>
>>>>>> This mode has the advantage of allowing a user to obtain a large
>>>>>> allocation,
>>>>>> and then designate hosts within the pool for use by an initial
>>>>>> application,
>>>>>> and separately designate (via another hostfile or -host spec) another set
>>>>>> of
>>>>>> those hosts from the pool to support a comm_spawn'd child job.
>>>>>>
>>>>>> If no resource managed nodes are found, then the hostfile and -host
>>>>>> options
>>>>>> would provide the list of hosts to be used. Again, comm_spawn'd jobs
>>>>>> would
>>>>>> be able to specify their own hostfile and -host nodes.
>>>>>>
>>>>>> The negative to this option is complexity - in the absence of a managed
>>>>>> allocation, I either have to deal with hostfile/dash-host allocations in
>>>>>> the
>>>>>> RAS and then again in RMAPS, or I have "allocation-like" functionality
>>>>>> happening in RMAPS.
>>>>>>
>>>>>>
>>>>>> Option 2: in this mode, we read any allocated nodes provided by a
>>>>>> resource
>>>>>> manager, and then filter those using the command line hostfile and -host
>>>>>> options to establish our base pool. Any spawn commands (both the initial
>>>>>> one
>>>>>> and comm_spawn'd child jobs) would utilize this filtered pool of nodes.
>>>>>> Thus, comm_spawn is restricted to using hosts from that initial pool.
>>>>>>
>>>>>> We could possibly extend this option by only using the hostfile in our
>>>>>> initial filter. In other words, let the hostfile downselect the resource
>>>>>> manager's allocation for the initial launch. Any -host options on the
>>>>>> command line would only apply to the hosts used to launch the initial
>>>>>> application. Any comm_spawn would use the hostfile-filtered pool of
>>>>>> hosts.
>>>>>>
>>>>>> The advantage here is simplicity. The disadvantage lies in flexibility
>>>>>> for
>>>>>> supporting dynamic operations.
>>>>>>
>>>>>>
>>>>>> The major difference between these options really only impacts the
>>>>>> initial
>>>>>> pool of hosts to be used for launches, both the initial one and any
>>>>>> subsequent comm_spawns. Barring any commentary, I will implement option 1
>>>>>> as
>>>>>> this provides the maximum flexibility.
>>>>>>
>>>>>> Any thoughts? Other options we should consider?
>>>>>>
>>>>>> Thanks
>>>>>> Ralph
>>>>>>
>>>>>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel