Perhaps some bad news on this subject - see below.
On 7/26/07 7:53 AM, "Ralph H Castain" <rhc_at_[hidden]> wrote:
> On 7/26/07 7:33 AM, "Rolf.Vandevaart_at_[hidden]" <Rolf.Vandevaart_at_[hidden]>
>> Aurelien Bouteiller wrote:
>>> Currently I proceed to two different mpirun with a single orte
>>> seed holding the registry. This way I get two different hostfiles, one
>>> for computing nodes, one for FT services. I just want to make sure
>>> everybody understood this requirement so that this feature does not
>>> disappear in the brainstorming :]
After some investigation, I'm afraid that I have to report that this - as
far as I understand what you are doing - may no longer work in Open MPI in
the future (and I'm pretty sure isn't working in the trunk today except
[maybe] in the special case of hostfile - haven't verified that).
To ensure we are correctly communicating, let me reiterate what I understand
you are doing:
1. in one window, you start a persistent daemon. You then enter "mpirun" to
that command line, specifying a hostfile (let's call it "foo" for now) and
the universe used to start the persistent daemon. Thus, mpirun connects to
that universe and runs within it.
2. in another window, you type "mpirun" to the command line, specifying a
different hostfile ("bar") and again giving it the universe used to start
the persistent daemon. Thus, both mpiruns are being "managed" by the same
HNP (the persistent daemon).
First, there are major issues here involving confusion over allocations and
synchronization between the lifetimes of the two jobs started in this
manner. You may not see those in hostfile-only use cases, but for managed
environments, this proved to cause undesirable confusion over process
placement and unexpected application failures. Accordingly, we have been
working to eliminate this usage (although the trunk will currently still
allow it in some cases).
This was caused by mpirun itself processing its local environment and then
"pushing" it into the global registry. Keeping everything separated causes a
bookkeeper's headache and many lines of code that we would like to
The current future design only processes allocations at the HNP itself.
Thus, the persistent daemon would only be capable of sensing its own local
allocation - it cannot see an allocation obtained in a separate
window/login. This unfortunately extends to hostfiles as well - the
persistent daemon can process the hostfile provided on its command line or
environment, but has no mechanism for reading another one.
The exception to this is comm_spawn. Our current intent was to allow
comm_spawn to specify a hostfile that could be read by the HNP and used for
the child job. However, we are still discussing whether this hostfile should
be allowed to "add" nodes to the known available resources, or only specify
a subset of the already-known resource pool. I suspect we will opt for the
latter interpretation as we otherwise open an entirely different set of
So I am not sure that you will be able to continue working this way. You may
have to start your regular application with the larger pool of resources,
specify the ones you want used for the application itself via -host, and
then "comm_spawn" your FT services on the other nodes using -host in that
launch. Alternatively, you could use the multiple app_context capability to
start it all from the command line:
mpirun -hostfile big_pool -n 10 -host 1,2,3,4 application : -n 2 -host
Hope that helps explain things. As I hope I have indicated, I -think- you
will still be able to do what you described, but probably not the way you
have been doing it.
Please feel free to comment. If this is a big enough issue to a large enough
audience, then we can try to find a way to solve it (assuming Open MPI's
community decides to support it).
>>> Next requirement is the ability to add during runtime some nodes to the
>>> initial pool. Because node may fail (but it is the same with comm_spawn
>>> basically) , I might need some (lot of) spare nodes to replace failed
>>> ones. As I do not want to request for twice as many nodes as I need
>>> (after all, things could just go fine, why should I waste that many
>>> computing resources for idle spares ?), I definitely want to be able to
>>> allocate some new nodes to the pool of the already running machines. As
>>> far as I understand, this is impossible to achieve with the usecase2 and
>>> quite difficult in usecase1. In my opinion, having the ability to spawn
>>> on nodes which are not part of the initial hostfile is a key feature
>>> (and not only for FT purposes).
>> I am looking for more detail into the above issue. What
>> resource manager are you using?
>> Ideally, we would prefer not to support this. Any nodes
>> that you run on, or hope to run on, would be designated
>> at the start. For example:
>> mpirun -np 1 --host a,b,c,d,e,f,g
>> This would cause the one process of the mpi job to start on host a.
>> Then, the mpi job has available to it the other hosts should it decide
>> later to start a job on them. However no ORTE daemons would
>> be started on those nodes until calls to MPI_Comm_spawn
>> occur. So, the MPI job would not be consuming any resources
>> until called upon to.
> This has actually been the subject of multiple threads on the user list and
> is considered a critical capability by some users and vendors. I believe
> there is little problem in allowing those systems that can support it to
> dynamically add nodes to ORTE via some API into the resource manager. At the
> moment, none of the RMs support it, but LSF will (and TM at least may)
> shortly do so, and some of their customers are depending upon it.
> The problem is that job startup could be delayed for significant time if all
> hosts must be preallocated. Admittedly, this raises all kinds of issues
> about how long the job could be stalled waiting for the new hosts. However,
> as the other somewhat exhaustive threads have discussed, there are computing
> models that can live with this uncertainty, and RMs that will provide async
> callbacks to allow the rest of the app to continue working while waiting.
> Just my $0.00002 - again, this goes back to...are there use-cases and
> customers to which Open MPI is simply going to say "we won't support that"?
>>> I know there have been some extra discussions on this subject.
>>> Unfortunately it looks like I am not part of the list where it happened.
>>> I hope those concerns have not been already discussed.
>>> Ralph H Castain wrote:
>>>> Yo all
>>>> As you know, I am working on revamping the hostfile functionality to make
>>>> work better with managed environments (at the moment, the two are
>>>> exclusive). The issue that we need to review is how we want the interaction
>>>> to work, both for the initial launch and for comm_spawn.
>>>> In talking with Jeff, we boiled it down to two options that I have
>>>> flow-charted (see attached):
>>>> Option 1: in this mode, we read any allocated nodes provided by a resource
>>>> manager (e.g., SLURM). These nodes establish a base pool of nodes that can
>>>> be used by both the initial launch and any dynamic comm_spawn requests. The
>>>> hostfile and any -host info is then used to select nodes from within that
>>>> pool for use with the specific launch. The initial launch would use the
>>>> -hostfile or -host command line option to provide that info - comm_spawn
>>>> would use the MPI_Info fields to provide similar info.
>>>> This mode has the advantage of allowing a user to obtain a large
>>>> and then designate hosts within the pool for use by an initial application,
>>>> and separately designate (via another hostfile or -host spec) another set
>>>> those hosts from the pool to support a comm_spawn'd child job.
>>>> If no resource managed nodes are found, then the hostfile and -host options
>>>> would provide the list of hosts to be used. Again, comm_spawn'd jobs would
>>>> be able to specify their own hostfile and -host nodes.
>>>> The negative to this option is complexity - in the absence of a managed
>>>> allocation, I either have to deal with hostfile/dash-host allocations in
>>>> RAS and then again in RMAPS, or I have "allocation-like" functionality
>>>> happening in RMAPS.
>>>> Option 2: in this mode, we read any allocated nodes provided by a resource
>>>> manager, and then filter those using the command line hostfile and -host
>>>> options to establish our base pool. Any spawn commands (both the initial
>>>> and comm_spawn'd child jobs) would utilize this filtered pool of nodes.
>>>> Thus, comm_spawn is restricted to using hosts from that initial pool.
>>>> We could possibly extend this option by only using the hostfile in our
>>>> initial filter. In other words, let the hostfile downselect the resource
>>>> manager's allocation for the initial launch. Any -host options on the
>>>> command line would only apply to the hosts used to launch the initial
>>>> application. Any comm_spawn would use the hostfile-filtered pool of hosts.
>>>> The advantage here is simplicity. The disadvantage lies in flexibility for
>>>> supporting dynamic operations.
>>>> The major difference between these options really only impacts the initial
>>>> pool of hosts to be used for launches, both the initial one and any
>>>> subsequent comm_spawns. Barring any commentary, I will implement option 1
>>>> this provides the maximum flexibility.
>>>> Any thoughts? Other options we should consider?
>>>> devel mailing list
>>> devel mailing list
>> devel mailing list
> devel mailing list