Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)
From: Adams, Brian M (briadam_at_[hidden])
Date: 2009-07-30 13:49:15


Apologies if I'm being confusing; I'm probably trying to get at atypical use cases. M and N need not correspond to the number of nodes/ppn nor ppn/nodes available. By node vs. slot doesn't much matter, as long as in the end I don't oversubscribe any node. By slot might be good for efficiency in some apps, but I can't make a general case for it.

I think what you proposed offers some help in the case where N is an integer multiple of the number of available nodes, but perhaps not in other cases. I must be missing something here, so instead of being fully general, perhaps consider a specific case. Suppose we have 4 nodes, 8 ppn (32 slots is I think the ompi language). I might want to schedule, for example

1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will work since N is a multiple of the available number of nodes. I could use either npernode 4 or just bynode and I think get the same result: an even distribution of tasks. (similar applies to, e.g., 8x4, 4x8)

2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, I would end up with 16 processes on each of the first two nodes (similar applies to, e.g., 32x1 or 10x3). Scheduling many small jobs is a common problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this distribution (where A-D are nodes and 0-2 jobs)

A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0 1 1 2 2
D 0 0 1 1 2 2

where A and B are over-subscribed and there are more than the two unused slots I'd expect in the whole allocation.

Again, I can manage all these via a script that partitions the machine files, just wondering which scenarios OpenMPI can manage.

Thanks!
Brian

> -----Original Message-----
> From: users-bounces_at_[hidden]
> [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
> Sent: Wednesday, July 29, 2009 4:19 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Multiple mpiexec's within a job
> (schedule within a scheduled machinefile/job allocation)
>
> Oh my - that does take me back a long way! :-)
>
> Do you need these processes to be mapped byslot (i.e., do you
> care if the process ranks are sharing nodes)? If not, why not
> add "-bynode" to your cmd line?
>
> Alternatively, given the mapping you want, just do
>
> mpirun -npernode 1 application.exe
>
> This would launch one copy on each of your N nodes. So if you
> fork M times, you'll wind up with the exact pattern you
> wanted. And, as each one exits, you could immediately launch
> a replacement without worrying about oversubscription.
>
> Does that help?
> Ralph
>
> PS. we dropped that "persistent" operation - caused way too
> many problems with cleanup and other things. :-)
>
> On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:
>
> > Hi Ralph (all),
> >
> > I'm resurrecting this 2006 thread for a status check. The
> new 1.3.x
> > machinefile behavior is great (thanks!) -- I can use
> machinefiles to
> > manage multiple simultaneous mpiruns within a single torque
> > allocation (where the hosts are a subset of $PBS_NODEFILE).
> > However, this requires some careful management of machinefiles.
> >
> > I'm curious if OpenMPI now directly supports the behavior I need,
> > described in general in the quote below. Specifically,
> given a single
> > PBS/Torque allocation of M*N processors, I will run a
> serial program
> > that will fork M times. Each of the M forked processes
> > calls 'mpirun -np N application.exe' and blocks until completion.
> > This seems akin to the case you described of "mpiruns executed in
> > separate windows/prompts."
> >
> > What I'd like to see is the M processes "tiled" across the
> available
> > slots, so all M*N processors are used. What I see instead
> appears at
> > face value to be the first N resources being oversubscribed M times.
> >
> > Also, when one of the forked processes returns, I'd like to
> be able to
> > spawn another and have its mpirun schedule on the resources
> freed by
> > the previous one that exited. Is any of this possible?
> >
> > I tried starting an orted (1.3.3, roughly as you suggested
> below), but
> > got this error:
> >
> >> orted --daemonize
> > [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
> > runtime/orte_init.c at line 125
> >
> ----------------------------------------------------------------------
> > ---- It looks like orte_init failed for some reason; your parallel
> > process is likely to abort. There are many reasons that a parallel
> > process can fail during orte_init; some of which are due to
> > configuration or environment problems. This failure
> appears to be an
> > internal failure; here's some additional information (which
> may only
> > be relevant to an Open MPI developer):
> >
> > orte_ess_base_select failed
> > --> Returned value Not found (-13) instead of ORTE_SUCCESS
> >
> ----------------------------------------------------------------------
> > ---- [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
> found in file
> > orted/orted_main.c at line 323
> >
> > I spared the debugging info as I'm not even sure this is a correct
> > invocation...
> >
> > Thanks for any suggestions you can offer!
> > Brian
> > ----------
> > Brian M. Adams, PhD (briadam_at_[hidden]) Optimization and
> Uncertainty
> > Quantification Sandia National Laboratories, Albuquerque, NM
> > http://www.sandia.gov/~briadam
> >
> >
> >> From: Ralph Castain (rhc_at_[hidden])
> >> Date: 2006-12-12 00:46:59
> >>
> >> Hi Chris
> >>
> >>
> >> Some of this is doable with today's code....and one of these
> >> behaviors is not. :-(
> >>
> >>
> >> Open MPI/OpenRTE can be run in "persistent" mode - this allows
> >> multiple jobs to share the same allocation. This works much as you
> >> describe (syntax is slightly different, of
> >> course!) - the first mpirun will map using whatever mode was
> >> requested, then the next mpirun will map starting from where the
> >> first one left off.
> >>
> >>
> >> I *believe* you can run each mpirun in the background.
> >> However, I don't know if this has really been tested enough to
> >> support such a claim. All testing that I know about to-date has
> >> executed mpirun in the foreground - thus, your example
> would execute
> >> sequentially instead of in parallel.
> >>
> >>
> >> I know people have tested multiple mpirun's operating in parallel
> >> within a single allocation (i.e., persistent mode) where
> the mpiruns
> >> are executed in separate windows/prompts.
> >> So I suspect you could do something like you describe -
> just haven't
> >> personally verified it.
> >>
> >>
> >> Where we definitely differ is that Open MPI/RTE will *not* block
> >> until resources are freed up from the prior mpiruns.
> >> Instead, we will attempt to execute each mpirun immediately - and
> >> will error out the one(s) that try to execute without sufficient
> >> resources. I imagine we could provide the kind of "flow
> control" you
> >> describe, but I'm not sure when that might happen.
> >>
> >>
> >> I am (in my copious free time...haha) working on an "orteboot"
> >> program that will startup a virtual machine to make the persistent
> >> mode of operation a little easier. For now, though, you
> can do it by:
> >>
> >>
> >> 1. starting up the "server" using the following command:
> >> orted --seed --persistent --scope public [--universe foo]
> >>
> >>
> >> 2. do your mpirun commands. They will automagically find
> the "server"
> >> and connect to it. If you specified a universe name when
> starting the
> >> server, then you must specify the same universe name on
> your mpirun
> >> commands.
> >>
> >>
> >> When you are done, you will have to (unfortunately)
> manually "kill"
> >> the server and remove its session directory. I have a
> program called
> >> "ortehalt"
> >> in the trunk that will do this cleanly for you, but it
> isn't yet in
> >> the release distributions. You are welcome to use it,
> though, if you
> >> are working with the trunk - I can't promise it is
> bulletproof yet,
> >> but it seems to be working.
> >>
> >>
> >> Ralph
> >>
> >>
> >> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel"
> >> <cdmaest_at_[hidden]>
> >> wrote:
> >>
> >>
> >>> Hello,
> >>>
> >>> Sometimes we have users that like to do from within a single job
> >>> (think schedule within an job scheduler allocation):
> >>> "mpiexec -n X myprog"
> >>> "mpiexec -n Y myprog2"
> >>> Does mpiexec within Open MPI keep track of the node list it
> >> is using
> >>> if it binds to a particular scheduler?
> >>> For example with 4 nodes (2ppn SMP):
> >>> "mpiexec -n 2 myprog"
> >>> "mpiexec -n 2 myprog2"
> >>> "mpiexec -n 1 myprog3"
> >>> And assume this is by-slot allocation we would have the following
> >>> allocation:
> >>> node1 - processor1 - myprog
> >>> - processor2 - myprog
> >>> node2 - processor1 - myprog2
> >>> - processor2 - myprog2
> >>> And for a by-node allocation:
> >>> node1 - processor1 - myprog
> >>> - processor2 - myprog2
> >>> node2 - processor1 - myprog
> >>> - processor2 - myprog2
> >>>
> >>> I think this is possible using ssh cause it shouldn't
> really matter
> >>> how many times it spawns, but with something like torque it
> >> would get
> >>> restricted to a max process launch of 4. We would want the third
> >>> mpiexec to block processes and eventually be run on the first
> >>> available node allocation that frees up from myprog or
> myprog2 ....
> >>>
> >>> For example for torque, we had to add the following to
> osc mpiexec:
> >>> ---
> >>> Finally, since only one mpiexec can be the master at a
> >> time, if your
> >>> code setup requires that mpiexec exit to get a result, you
> >> can start a
> >>> "dummy"
> >>> mpiexec first in your batch
> >>> job:
> >>>
> >>> mpiexec -server
> >>>
> >>> It runs no tasks itself but handles the connections of
> >> other transient
> >>> mpiexec clients.
> >>> It will shut down cleanly when the batch job exits or you
> >> may kill the
> >>> server explicitly.
> >>> If the server is killed with SIGTERM (or HUP or INT), it
> will exit
> >>> with a status of zero if there were no clients connected at
> >> the time.
> >>> If there were still clients using the server, the server
> >> will kill all
> >>> their tasks, disconnect from the clients, and exit with status 1.
> >>> ---
> >>>
> >>> So a user ran:
> >>> mpiexec -server
> >>> mpiexec -n 2 myprog
> >>> mpiexec -n 2 myprog2
> >>> And the server kept track of the allocation ... I would
> >> think that the
> >>> orted could do this?
> >>>
> >>> Sorry if this sounds confusing ... But I'm sure it will
> >> clear up with
> >>> any further responses I make. :-) -cdm
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> users_at_[hidden]
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
> > _______________________________________________
> > users mailing list
> > users_at_[hidden]
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>