Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-07-30 16:07:33


Let me know how it goes, if you don't mind. It would be nice to know
if we actually met your needs, or if a tweak might help make it easier.

Thanks
Ralph

On Jul 30, 2009, at 1:36 PM, Adams, Brian M wrote:

> Thanks Ralph, I wasn't aware of the relative indexing or sequential
> mapper capabilities. I will check those out and report back if I
> still have a feature request. -- Brian
>
> From: users-bounces_at_[hidden] [mailto:users-bounces_at_[hidden]]
> On Behalf Of Ralph Castain
> Sent: Thursday, July 30, 2009 12:26 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule
> within a scheduled machinefile/job allocation)
>
>
> On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:
>
>> Apologies if I'm being confusing; I'm probably trying to get at
>> atypical use cases. M and N need not correspond to the number of
>> nodes/ppn nor ppn/nodes available. By node vs. slot doesn't much
>> matter, as long as in the end I don't oversubscribe any node. By
>> slot might be good for efficiency in some apps, but I can't make a
>> general case for it.
>>
>> I think what you proposed offers some help in the case where N is
>> an integer multiple of the number of available nodes, but perhaps
>> not in other cases. I must be missing something here, so instead
>> of being fully general, perhaps consider a specific case. Suppose
>> we have 4 nodes, 8 ppn (32 slots is I think the ompi language). I
>> might want to schedule, for example
>>
>> 1. M=2 simultaneous N=16 processor jobs: Here I believe what you
>> suggested will work since N is a multiple of the available number
>> of nodes. I could use either npernode 4 or just bynode and I think
>> get the same result: an even distribution of tasks. (similar
>> applies to, e.g., 8x4, 4x8)
>
> Yes, agreed
>
>>
>> 2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode
>> or npernode, I would end up with 16 processes on each of the first
>> two nodes (similar applies to, e.g., 32x1 or 10x3). Scheduling
>> many small jobs is a common problem for us.
>
>>
>> 3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with
>> this distribution (where A-D are nodes and 0-2 jobs)
>>
>> A 0 0 0 1 1 1 2 2 2
>> B 0 0 0 1 1 1 2 2 2
>> C 0 0 1 1 2 2
>> D 0 0 1 1 2 2
>>
>> where A and B are over-subscribed and there are more than the two
>> unused slots I'd expect in the whole allocation.
>>
>> Again, I can manage all these via a script that partitions the
>> machine files, just wondering which scenarios OpenMPI can manage.
>>
>
> Have you looked at the relative indexing in 1.3.3? You could specify
> any of these in relative index terms, and have one "hostfile" that
> would support 16x2 operations. This would work then for any
> allocation.
>
> Your launch script could even just do it, something like this:
>
> mpirun -n 2 -host +n0:1,+n1:1 app
> mpirun -n 2 -host +n0:2,+n1:2 app
>
> etc. Obviously, you could compute the relative indexing and just
> stick it in as required.
>
> Likewise, you could use the new "seq" (sequential) mapper to achieve
> any desired layout, again utilizing relative indexing to avoid
> having to create a special hostfile for each run.
>
> Note that in all cases, you can specify a -n N that will tell OMPI
> to only execute N processes, regardless of what is in the sequential
> mapper file or -host.
>
> If none of those work well, please let me know. I'm happy to create
> the required capability as I'm sure LANL will use it too (know of
> several similar cases here, but the current options seem okay for
> them).
>
>>
>> Thanks!
>> Brian
>>
>>> -----Original Message-----
>>> From: users-bounces_at_[hidden]
>>> [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
>>> Sent: Wednesday, July 29, 2009 4:19 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Multiple mpiexec's within a job
>>> (schedule within a scheduled machinefile/job allocation)
>>>
>>> Oh my - that does take me back a long way! :-)
>>>
>>> Do you need these processes to be mapped byslot (i.e., do you
>>> care if the process ranks are sharing nodes)? If not, why not
>>> add "-bynode" to your cmd line?
>>>
>>> Alternatively, given the mapping you want, just do
>>>
>>> mpirun -npernode 1 application.exe
>>>
>>> This would launch one copy on each of your N nodes. So if you
>>> fork M times, you'll wind up with the exact pattern you
>>> wanted. And, as each one exits, you could immediately launch
>>> a replacement without worrying about oversubscription.
>>>
>>> Does that help?
>>> Ralph
>>>
>>> PS. we dropped that "persistent" operation - caused way too
>>> many problems with cleanup and other things. :-)
>>>
>>> On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:
>>>
>>>> Hi Ralph (all),
>>>>
>>>> I'm resurrecting this 2006 thread for a status check. The
>>> new 1.3.x
>>>> machinefile behavior is great (thanks!) -- I can use
>>> machinefiles to
>>>> manage multiple simultaneous mpiruns within a single torque
>>>> allocation (where the hosts are a subset of $PBS_NODEFILE).
>>>> However, this requires some careful management of machinefiles.
>>>>
>>>> I'm curious if OpenMPI now directly supports the behavior I need,
>>>> described in general in the quote below. Specifically,
>>> given a single
>>>> PBS/Torque allocation of M*N processors, I will run a
>>> serial program
>>>> that will fork M times. Each of the M forked processes
>>>> calls 'mpirun -np N application.exe' and blocks until completion.
>>>> This seems akin to the case you described of "mpiruns executed in
>>>> separate windows/prompts."
>>>>
>>>> What I'd like to see is the M processes "tiled" across the
>>> available
>>>> slots, so all M*N processors are used. What I see instead
>>> appears at
>>>> face value to be the first N resources being oversubscribed M
>>>> times.
>>>>
>>>> Also, when one of the forked processes returns, I'd like to
>>> be able to
>>>> spawn another and have its mpirun schedule on the resources
>>> freed by
>>>> the previous one that exited. Is any of this possible?
>>>>
>>>> I tried starting an orted (1.3.3, roughly as you suggested
>>> below), but
>>>> got this error:
>>>>
>>>>> orted --daemonize
>>>> [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
>>>> runtime/orte_init.c at line 125
>>>>
>>> ----------------------------------------------------------------------
>>>> ---- It looks like orte_init failed for some reason; your parallel
>>>> process is likely to abort. There are many reasons that a parallel
>>>> process can fail during orte_init; some of which are due to
>>>> configuration or environment problems. This failure
>>> appears to be an
>>>> internal failure; here's some additional information (which
>>> may only
>>>> be relevant to an Open MPI developer):
>>>>
>>>> orte_ess_base_select failed
>>>> --> Returned value Not found (-13) instead of ORTE_SUCCESS
>>>>
>>> ----------------------------------------------------------------------
>>>> ---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
>>> found in file
>>>> orted/orted_main.c at line 323
>>>>
>>>> I spared the debugging info as I'm not even sure this is a correct
>>>> invocation...
>>>>
>>>> Thanks for any suggestions you can offer!
>>>> Brian
>>>> ----------
>>>> Brian M. Adams, PhD (briadam_at_[hidden]) Optimization and
>>> Uncertainty
>>>> Quantification Sandia National Laboratories, Albuquerque, NM
>>>> http://www.sandia.gov/~briadam
>>>>
>>>>
>>>>> From: Ralph Castain (rhc_at_[hidden])
>>>>> Date: 2006-12-12 00:46:59
>>>>>
>>>>> Hi Chris
>>>>>
>>>>>
>>>>> Some of this is doable with today's code....and one of these
>>>>> behaviors is not. :-(
>>>>>
>>>>>
>>>>> Open MPI/OpenRTE can be run in "persistent" mode - this allows
>>>>> multiple jobs to share the same allocation. This works much as you
>>>>> describe (syntax is slightly different, of
>>>>> course!) - the first mpirun will map using whatever mode was
>>>>> requested, then the next mpirun will map starting from where the
>>>>> first one left off.
>>>>>
>>>>>
>>>>> I *believe* you can run each mpirun in the background.
>>>>> However, I don't know if this has really been tested enough to
>>>>> support such a claim. All testing that I know about to-date has
>>>>> executed mpirun in the foreground - thus, your example
>>> would execute
>>>>> sequentially instead of in parallel.
>>>>>
>>>>>
>>>>> I know people have tested multiple mpirun's operating in parallel
>>>>> within a single allocation (i.e., persistent mode) where
>>> the mpiruns
>>>>> are executed in separate windows/prompts.
>>>>> So I suspect you could do something like you describe -
>>> just haven't
>>>>> personally verified it.
>>>>>
>>>>>
>>>>> Where we definitely differ is that Open MPI/RTE will *not* block
>>>>> until resources are freed up from the prior mpiruns.
>>>>> Instead, we will attempt to execute each mpirun immediately - and
>>>>> will error out the one(s) that try to execute without sufficient
>>>>> resources. I imagine we could provide the kind of "flow
>>> control" you
>>>>> describe, but I'm not sure when that might happen.
>>>>>
>>>>>
>>>>> I am (in my copious free time...haha) working on an "orteboot"
>>>>> program that will startup a virtual machine to make the persistent
>>>>> mode of operation a little easier. For now, though, you
>>> can do it by:
>>>>>
>>>>>
>>>>> 1. starting up the "server" using the following command:
>>>>> orted --seed --persistent --scope public [--universe foo]
>>>>>
>>>>>
>>>>> 2. do your mpirun commands. They will automagically find
>>> the "server"
>>>>> and connect to it. If you specified a universe name when
>>> starting the
>>>>> server, then you must specify the same universe name on
>>> your mpirun
>>>>> commands.
>>>>>
>>>>>
>>>>> When you are done, you will have to (unfortunately)
>>> manually "kill"
>>>>> the server and remove its session directory. I have a
>>> program called
>>>>> "ortehalt"
>>>>> in the trunk that will do this cleanly for you, but it
>>> isn't yet in
>>>>> the release distributions. You are welcome to use it,
>>> though, if you
>>>>> are working with the trunk - I can't promise it is
>>> bulletproof yet,
>>>>> but it seems to be working.
>>>>>
>>>>>
>>>>> Ralph
>>>>>
>>>>>
>>>>> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
>>>>> <cdmaest_at_[hidden]>
>>>>> wrote:
>>>>>
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> Sometimes we have users that like to do from within a single job
>>>>>> (think schedule within an job scheduler allocation):
>>>>>> "mpiexec -n X myprog"
>>>>>> "mpiexec -n Y myprog2"
>>>>>> Does mpiexec within Open MPI keep track of the node list it
>>>>> is using
>>>>>> if it binds to a particular scheduler?
>>>>>> For example with 4 nodes (2ppn SMP):
>>>>>> "mpiexec -n 2 myprog"
>>>>>> "mpiexec -n 2 myprog2"
>>>>>> "mpiexec -n 1 myprog3"
>>>>>> And assume this is by-slot allocation we would have the following
>>>>>> allocation:
>>>>>> node1 - processor1 - myprog
>>>>>> - processor2 - myprog
>>>>>> node2 - processor1 - myprog2
>>>>>> - processor2 - myprog2
>>>>>> And for a by-node allocation:
>>>>>> node1 - processor1 - myprog
>>>>>> - processor2 - myprog2
>>>>>> node2 - processor1 - myprog
>>>>>> - processor2 - myprog2
>>>>>>
>>>>>> I think this is possible using ssh cause it shouldn't
>>> really matter
>>>>>> how many times it spawns, but with something like torque it
>>>>> would get
>>>>>> restricted to a max process launch of 4. We would want the third
>>>>>> mpiexec to block processes and eventually be run on the first
>>>>>> available node allocation that frees up from myprog or
>>> myprog2 ....
>>>>>>
>>>>>> For example for torque, we had to add the following to
>>> osc mpiexec:
>>>>>> ---
>>>>>> Finally, since only one mpiexec can be the master at a
>>>>> time, if your
>>>>>> code setup requires that mpiexec exit to get a result, you
>>>>> can start a
>>>>>> "dummy"
>>>>>> mpiexec first in your batch
>>>>>> job:
>>>>>>
>>>>>> mpiexec -server
>>>>>>
>>>>>> It runs no tasks itself but handles the connections of
>>>>> other transient
>>>>>> mpiexec clients.
>>>>>> It will shut down cleanly when the batch job exits or you
>>>>> may kill the
>>>>>> server explicitly.
>>>>>> If the server is killed with SIGTERM (or HUP or INT), it
>>> will exit
>>>>>> with a status of zero if there were no clients connected at
>>>>> the time.
>>>>>> If there were still clients using the server, the server
>>>>> will kill all
>>>>>> their tasks, disconnect from the clients, and exit with status 1.
>>>>>> ---
>>>>>>
>>>>>> So a user ran:
>>>>>> mpiexec -server
>>>>>> mpiexec -n 2 myprog
>>>>>> mpiexec -n 2 myprog2
>>>>>> And the server kept track of the allocation ... I would
>>>>> think that the
>>>>>> orted could do this?
>>>>>>
>>>>>> Sorry if this sounds confusing ... But I'm sure it will
>>>>> clear up with
>>>>>> any further responses I make. :-) -cdm
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users