Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-29 18:19:28

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you care if
the process ranks are sharing nodes)? If not, why not add "-bynode" to
your cmd line?

Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you fork M
times, you'll wind up with the exact pattern you wanted. And, as each
one exits, you could immediately launch a replacement without worrying
about oversubscription.

Does that help?

PS. we dropped that "persistent" operation - caused way too many
problems with cleanup and other things. :-)

On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:

> Hi Ralph (all),
> I'm resurrecting this 2006 thread for a status check. The new 1.3.x
> machinefile behavior is great (thanks!) -- I can use machinefiles to
> manage multiple simultaneous mpiruns within a single torque
> allocation (where the hosts are a subset of $PBS_NODEFILE).
> However, this requires some careful management of machinefiles.
> I'm curious if OpenMPI now directly supports the behavior I need,
> described in general in the quote below. Specifically, given a
> single PBS/Torque allocation of M*N processors, I will run a serial
> program that will fork M times. Each of the M forked processes
> calls 'mpirun -np N application.exe' and blocks until completion.
> This seems akin to the case you described of "mpiruns executed in
> separate windows/prompts."
> What I'd like to see is the M processes "tiled" across the available
> slots, so all M*N processors are used. What I see instead appears
> at face value to be the first N resources being oversubscribed M
> times.
> Also, when one of the forked processes returns, I'd like to be able
> to spawn another and have its mpirun schedule on the resources freed
> by the previous one that exited. Is any of this possible?
> I tried starting an orted (1.3.3, roughly as you suggested below),
> but got this error:
>> orted --daemonize
> [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> runtime/orte_init.c at line 125
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel
> process is
> likely to abort. There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems. This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> orte_ess_base_select failed
> --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> orted/orted_main.c at line 323
> I spared the debugging info as I'm not even sure this is a correct
> invocation...
> Thanks for any suggestions you can offer!
> Brian
> ----------
> Brian M. Adams, PhD (briadam_at_[hidden])
> Optimization and Uncertainty Quantification
> Sandia National Laboratories, Albuquerque, NM
>> From: Ralph Castain (rhc_at_[hidden])
>> Date: 2006-12-12 00:46:59
>> Hi Chris
>> Some of this is doable with today's code....and one of these
>> behaviors is not. :-(
>> Open MPI/OpenRTE can be run in "persistent" mode - this
>> allows multiple jobs to share the same allocation. This works
>> much as you describe (syntax is slightly different, of
>> course!) - the first mpirun will map using whatever mode was
>> requested, then the next mpirun will map starting from where
>> the first one left off.
>> I *believe* you can run each mpirun in the background.
>> However, I don't know if this has really been tested enough
>> to support such a claim. All testing that I know about
>> to-date has executed mpirun in the foreground - thus, your
>> example would execute sequentially instead of in parallel.
>> I know people have tested multiple mpirun's operating in
>> parallel within a single allocation (i.e., persistent mode)
>> where the mpiruns are executed in separate windows/prompts.
>> So I suspect you could do something like you describe - just
>> haven't personally verified it.
>> Where we definitely differ is that Open MPI/RTE will *not*
>> block until resources are freed up from the prior mpiruns.
>> Instead, we will attempt to execute each mpirun immediately -
>> and will error out the one(s) that try to execute without
>> sufficient resources. I imagine we could provide the kind of
>> "flow control" you describe, but I'm not sure when that might happen.
>> I am (in my copious free time...haha) working on an
>> "orteboot" program that will startup a virtual machine to
>> make the persistent mode of operation a little easier. For
>> now, though, you can do it by:
>> 1. starting up the "server" using the following command:
>> orted --seed --persistent --scope public [--universe foo]
>> 2. do your mpirun commands. They will automagically find the
>> "server" and connect to it. If you specified a universe name
>> when starting the server, then you must specify the same
>> universe name on your mpirun commands.
>> When you are done, you will have to (unfortunately) manually
>> "kill" the server and remove its session directory. I have a
>> program called "ortehalt"
>> in the trunk that will do this cleanly for you, but it isn't
>> yet in the release distributions. You are welcome to use it,
>> though, if you are working with the trunk - I can't promise
>> it is bulletproof yet, but it seems to be working.
>> Ralph
>> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
>> <cdmaest_at_[hidden]>
>> wrote:
>>> Hello,
>>> Sometimes we have users that like to do from within a single job
>>> (think schedule within an job scheduler allocation):
>>> "mpiexec -n X myprog"
>>> "mpiexec -n Y myprog2"
>>> Does mpiexec within Open MPI keep track of the node list it
>> is using
>>> if it binds to a particular scheduler?
>>> For example with 4 nodes (2ppn SMP):
>>> "mpiexec -n 2 myprog"
>>> "mpiexec -n 2 myprog2"
>>> "mpiexec -n 1 myprog3"
>>> And assume this is by-slot allocation we would have the following
>>> allocation:
>>> node1 - processor1 - myprog
>>> - processor2 - myprog
>>> node2 - processor1 - myprog2
>>> - processor2 - myprog2
>>> And for a by-node allocation:
>>> node1 - processor1 - myprog
>>> - processor2 - myprog2
>>> node2 - processor1 - myprog
>>> - processor2 - myprog2
>>> I think this is possible using ssh cause it shouldn't really matter
>>> how many times it spawns, but with something like torque it
>> would get
>>> restricted to a max process launch of 4. We would want the third
>>> mpiexec to block processes and eventually be run on the first
>>> available node allocation that frees up from myprog or myprog2 ....
>>> For example for torque, we had to add the following to osc mpiexec:
>>> ---
>>> Finally, since only one mpiexec can be the master at a
>> time, if your
>>> code setup requires that mpiexec exit to get a result, you
>> can start a
>>> "dummy"
>>> mpiexec first in your batch
>>> job:
>>> mpiexec -server
>>> It runs no tasks itself but handles the connections of
>> other transient
>>> mpiexec clients.
>>> It will shut down cleanly when the batch job exits or you
>> may kill the
>>> server explicitly.
>>> If the server is killed with SIGTERM (or HUP or INT), it will exit
>>> with a status of zero if there were no clients connected at
>> the time.
>>> If there were still clients using the server, the server
>> will kill all
>>> their tasks, disconnect from the clients, and exit with status 1.
>>> ---
>>> So a user ran:
>>> mpiexec -server
>>> mpiexec -n 2 myprog
>>> mpiexec -n 2 myprog2
>>> And the server kept track of the allocation ... I would
>> think that the
>>> orted could do this?
>>> Sorry if this sounds confusing ... But I'm sure it will
>> clear up with
>>> any further responses I make. :-) -cdm
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
> _______________________________________________
> users mailing list
> users_at_[hidden]