Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-30 14:25:48


On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

> Apologies if I'm being confusing; I'm probably trying to get at
> atypical use cases. M and N need not correspond to the number of
> nodes/ppn nor ppn/nodes available. By node vs. slot doesn't much
> matter, as long as in the end I don't oversubscribe any node. By
> slot might be good for efficiency in some apps, but I can't make a
> general case for it.
>
> I think what you proposed offers some help in the case where N is an
> integer multiple of the number of available nodes, but perhaps not
> in other cases. I must be missing something here, so instead of
> being fully general, perhaps consider a specific case. Suppose we
> have 4 nodes, 8 ppn (32 slots is I think the ompi language). I
> might want to schedule, for example
>
> 1. M=2 simultaneous N=16 processor jobs: Here I believe what you
> suggested will work since N is a multiple of the available number of
> nodes. I could use either npernode 4 or just bynode and I think get
> the same result: an even distribution of tasks. (similar applies
> to, e.g., 8x4, 4x8)

Yes, agreed

>
> 2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or
> npernode, I would end up with 16 processes on each of the first two
> nodes (similar applies to, e.g., 32x1 or 10x3). Scheduling many
> small jobs is a common problem for us.

>
> 3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with
> this distribution (where A-D are nodes and 0-2 jobs)
>
> A 0 0 0 1 1 1 2 2 2
> B 0 0 0 1 1 1 2 2 2
> C 0 0 1 1 2 2
> D 0 0 1 1 2 2
>
> where A and B are over-subscribed and there are more than the two
> unused slots I'd expect in the whole allocation.
>
> Again, I can manage all these via a script that partitions the
> machine files, just wondering which scenarios OpenMPI can manage.
>

Have you looked at the relative indexing in 1.3.3? You could specify
any of these in relative index terms, and have one "hostfile" that
would support 16x2 operations. This would work then for any allocation.

Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and just stick
it in as required.

Likewise, you could use the new "seq" (sequential) mapper to achieve
any desired layout, again utilizing relative indexing to avoid having
to create a special hostfile for each run.

Note that in all cases, you can specify a -n N that will tell OMPI to
only execute N processes, regardless of what is in the sequential
mapper file or -host.

If none of those work well, please let me know. I'm happy to create
the required capability as I'm sure LANL will use it too (know of
several similar cases here, but the current options seem okay for them).

>
> Thanks!
> Brian
>
>> -----Original Message-----
>> From: users-bounces_at_[hidden]
>> [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
>> Sent: Wednesday, July 29, 2009 4:19 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Multiple mpiexec's within a job
>> (schedule within a scheduled machinefile/job allocation)
>>
>> Oh my - that does take me back a long way! :-)
>>
>> Do you need these processes to be mapped byslot (i.e., do you
>> care if the process ranks are sharing nodes)? If not, why not
>> add "-bynode" to your cmd line?
>>
>> Alternatively, given the mapping you want, just do
>>
>> mpirun -npernode 1 application.exe
>>
>> This would launch one copy on each of your N nodes. So if you
>> fork M times, you'll wind up with the exact pattern you
>> wanted. And, as each one exits, you could immediately launch
>> a replacement without worrying about oversubscription.
>>
>> Does that help?
>> Ralph
>>
>> PS. we dropped that "persistent" operation - caused way too
>> many problems with cleanup and other things. :-)
>>
>> On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:
>>
>>> Hi Ralph (all),
>>>
>>> I'm resurrecting this 2006 thread for a status check. The
>> new 1.3.x
>>> machinefile behavior is great (thanks!) -- I can use
>> machinefiles to
>>> manage multiple simultaneous mpiruns within a single torque
>>> allocation (where the hosts are a subset of $PBS_NODEFILE).
>>> However, this requires some careful management of machinefiles.
>>>
>>> I'm curious if OpenMPI now directly supports the behavior I need,
>>> described in general in the quote below. Specifically,
>> given a single
>>> PBS/Torque allocation of M*N processors, I will run a
>> serial program
>>> that will fork M times. Each of the M forked processes
>>> calls 'mpirun -np N application.exe' and blocks until completion.
>>> This seems akin to the case you described of "mpiruns executed in
>>> separate windows/prompts."
>>>
>>> What I'd like to see is the M processes "tiled" across the
>> available
>>> slots, so all M*N processors are used. What I see instead
>> appears at
>>> face value to be the first N resources being oversubscribed M times.
>>>
>>> Also, when one of the forked processes returns, I'd like to
>> be able to
>>> spawn another and have its mpirun schedule on the resources
>> freed by
>>> the previous one that exited. Is any of this possible?
>>>
>>> I tried starting an orted (1.3.3, roughly as you suggested
>> below), but
>>> got this error:
>>>
>>>> orted --daemonize
>>> [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>>> runtime/orte_init.c at line 125
>>>
>> ----------------------------------------------------------------------
>>> ---- It looks like orte_init failed for some reason; your parallel
>>> process is likely to abort. There are many reasons that a parallel
>>> process can fail during orte_init; some of which are due to
>>> configuration or environment problems. This failure
>> appears to be an
>>> internal failure; here's some additional information (which
>> may only
>>> be relevant to an Open MPI developer):
>>>
>>> orte_ess_base_select failed
>>> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>>>
>> ----------------------------------------------------------------------
>>> ---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>> found in file
>>> orted/orted_main.c at line 323
>>>
>>> I spared the debugging info as I'm not even sure this is a correct
>>> invocation...
>>>
>>> Thanks for any suggestions you can offer!
>>> Brian
>>> ----------
>>> Brian M. Adams, PhD (briadam_at_[hidden]) Optimization and
>> Uncertainty
>>> Quantification Sandia National Laboratories, Albuquerque, NM
>>> http://www.sandia.gov/~briadam
>>>
>>>
>>>> From: Ralph Castain (rhc_at_[hidden])
>>>> Date: 2006-12-12 00:46:59
>>>>
>>>> Hi Chris
>>>>
>>>>
>>>> Some of this is doable with today's code....and one of these
>>>> behaviors is not. :-(
>>>>
>>>>
>>>> Open MPI/OpenRTE can be run in "persistent" mode - this allows
>>>> multiple jobs to share the same allocation. This works much as you
>>>> describe (syntax is slightly different, of
>>>> course!) - the first mpirun will map using whatever mode was
>>>> requested, then the next mpirun will map starting from where the
>>>> first one left off.
>>>>
>>>>
>>>> I *believe* you can run each mpirun in the background.
>>>> However, I don't know if this has really been tested enough to
>>>> support such a claim. All testing that I know about to-date has
>>>> executed mpirun in the foreground - thus, your example
>> would execute
>>>> sequentially instead of in parallel.
>>>>
>>>>
>>>> I know people have tested multiple mpirun's operating in parallel
>>>> within a single allocation (i.e., persistent mode) where
>> the mpiruns
>>>> are executed in separate windows/prompts.
>>>> So I suspect you could do something like you describe -
>> just haven't
>>>> personally verified it.
>>>>
>>>>
>>>> Where we definitely differ is that Open MPI/RTE will *not* block
>>>> until resources are freed up from the prior mpiruns.
>>>> Instead, we will attempt to execute each mpirun immediately - and
>>>> will error out the one(s) that try to execute without sufficient
>>>> resources. I imagine we could provide the kind of "flow
>> control" you
>>>> describe, but I'm not sure when that might happen.
>>>>
>>>>
>>>> I am (in my copious free time...haha) working on an "orteboot"
>>>> program that will startup a virtual machine to make the persistent
>>>> mode of operation a little easier. For now, though, you
>> can do it by:
>>>>
>>>>
>>>> 1. starting up the "server" using the following command:
>>>> orted --seed --persistent --scope public [--universe foo]
>>>>
>>>>
>>>> 2. do your mpirun commands. They will automagically find
>> the "server"
>>>> and connect to it. If you specified a universe name when
>> starting the
>>>> server, then you must specify the same universe name on
>> your mpirun
>>>> commands.
>>>>
>>>>
>>>> When you are done, you will have to (unfortunately)
>> manually "kill"
>>>> the server and remove its session directory. I have a
>> program called
>>>> "ortehalt"
>>>> in the trunk that will do this cleanly for you, but it
>> isn't yet in
>>>> the release distributions. You are welcome to use it,
>> though, if you
>>>> are working with the trunk - I can't promise it is
>> bulletproof yet,
>>>> but it seems to be working.
>>>>
>>>>
>>>> Ralph
>>>>
>>>>
>>>> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
>>>> <cdmaest_at_[hidden]>
>>>> wrote:
>>>>
>>>>
>>>>> Hello,
>>>>>
>>>>> Sometimes we have users that like to do from within a single job
>>>>> (think schedule within an job scheduler allocation):
>>>>> "mpiexec -n X myprog"
>>>>> "mpiexec -n Y myprog2"
>>>>> Does mpiexec within Open MPI keep track of the node list it
>>>> is using
>>>>> if it binds to a particular scheduler?
>>>>> For example with 4 nodes (2ppn SMP):
>>>>> "mpiexec -n 2 myprog"
>>>>> "mpiexec -n 2 myprog2"
>>>>> "mpiexec -n 1 myprog3"
>>>>> And assume this is by-slot allocation we would have the following
>>>>> allocation:
>>>>> node1 - processor1 - myprog
>>>>> - processor2 - myprog
>>>>> node2 - processor1 - myprog2
>>>>> - processor2 - myprog2
>>>>> And for a by-node allocation:
>>>>> node1 - processor1 - myprog
>>>>> - processor2 - myprog2
>>>>> node2 - processor1 - myprog
>>>>> - processor2 - myprog2
>>>>>
>>>>> I think this is possible using ssh cause it shouldn't
>> really matter
>>>>> how many times it spawns, but with something like torque it
>>>> would get
>>>>> restricted to a max process launch of 4. We would want the third
>>>>> mpiexec to block processes and eventually be run on the first
>>>>> available node allocation that frees up from myprog or
>> myprog2 ....
>>>>>
>>>>> For example for torque, we had to add the following to
>> osc mpiexec:
>>>>> ---
>>>>> Finally, since only one mpiexec can be the master at a
>>>> time, if your
>>>>> code setup requires that mpiexec exit to get a result, you
>>>> can start a
>>>>> "dummy"
>>>>> mpiexec first in your batch
>>>>> job:
>>>>>
>>>>> mpiexec -server
>>>>>
>>>>> It runs no tasks itself but handles the connections of
>>>> other transient
>>>>> mpiexec clients.
>>>>> It will shut down cleanly when the batch job exits or you
>>>> may kill the
>>>>> server explicitly.
>>>>> If the server is killed with SIGTERM (or HUP or INT), it
>> will exit
>>>>> with a status of zero if there were no clients connected at
>>>> the time.
>>>>> If there were still clients using the server, the server
>>>> will kill all
>>>>> their tasks, disconnect from the clients, and exit with status 1.
>>>>> ---
>>>>>
>>>>> So a user ran:
>>>>> mpiexec -server
>>>>> mpiexec -n 2 myprog
>>>>> mpiexec -n 2 myprog2
>>>>> And the server kept track of the allocation ... I would
>>>> think that the
>>>>> orted could do this?
>>>>>
>>>>> Sorry if this sounds confusing ... But I'm sure it will
>>>> clear up with
>>>>> any further responses I make. :-) -cdm
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users