On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:
Apologies if I'm being confusing; I'm probably trying to get at
atypical use cases. M and N need not correspond to the
number of nodes/ppn nor ppn/nodes available. By node vs. slot
doesn't much matter, as long as in the end I don't oversubscribe any
node. By slot might be good for efficiency in some apps, but I
can't make a general case for it.
I think what you proposed
offers some help in the case where N is an integer multiple of the
number of available nodes, but perhaps not in other cases. I must
be missing something here, so instead of being fully general, perhaps
consider a specific case. Suppose we have 4 nodes, 8 ppn (32
slots is I think the ompi language). I might want to schedule, for
example
1. M=2 simultaneous N=16 processor jobs: Here I believe
what you suggested will work since N is a multiple of the available
number of nodes. I could use either npernode 4 or just bynode and
I think get the same result: an even distribution of tasks.
(similar applies to, e.g., 8x4, 4x8)
Yes, agreed
3. M=3
simultaneous, N=10 processor jobs: I think we'd end up with this
distribution (where A-D are nodes and 0-2 jobs)
A 0 0 0 1 1 1 2 2
2
B 0 0 0 1 1 1 2 2 2
C 0 0 1 1 2 2
D 0
0 1 1 2 2
where A and B are
over-subscribed and there are more than the two unused slots I'd expect
in the whole allocation.
Again, I can manage all these via a
script that partitions the machine files, just wondering which scenarios
OpenMPI can manage.
Have you looked at the relative indexing in 1.3.3? You could specify
any of these in relative index terms, and have one "hostfile" that would
support 16x2 operations. This would work then for any allocation.
Your launch script could even just do it, something like this:
mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app
etc. Obviously, you could compute the relative indexing and just
stick it in as required.
Likewise, you could use the new "seq" (sequential) mapper to achieve
any desired layout, again utilizing relative indexing to avoid having to
create a special hostfile for each run.
Note that in all cases, you can specify a -n N that will tell OMPI to
only execute N processes, regardless of what is in the sequential mapper
file or -host.
If none of those work well, please let me know. I'm happy to create
the required capability as I'm sure LANL will use it too (know of several
similar cases here, but the current options seem okay for them).
Thanks!
Brian
-----Original Message-----
From: users-bounces@open-mpi.org
[mailto:users-bounces@open-mpi.org]
On Behalf Of Ralph Castain
Sent: Wednesday, July 29, 2009 4:19
PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's
within a job
(schedule within a scheduled machinefile/job
allocation)
Oh my - that does take me back a long way!
:-)
Do you need these processes to be mapped
byslot (i.e., do you
care if the process ranks are sharing nodes)?
If not, why not
add "-bynode" to your cmd line?
Alternatively, given the mapping you want,
just do
mpirun -npernode 1
application.exe
This would launch one copy on each of your N
nodes. So if you
fork M times, you'll wind up with the exact
pattern you
wanted. And, as each one exits, you could
immediately launch
a replacement without worrying about
oversubscription.
Does that help?
Ralph
PS. we dropped that "persistent" operation -
caused way too
many problems with cleanup and other things.
:-)
On Jul 29, 2009, at 3:46 PM, Adams, Brian M
wrote:
Hi Ralph (all),
I'm resurrecting this 2006 thread for a
status check. The
new 1.3.x
machinefile behavior is great (thanks!) -- I
can use
machinefiles to
manage multiple simultaneous mpiruns within
a single torque
allocation (where the hosts are a subset of
$PBS_NODEFILE).
However, this requires some careful
management of machinefiles.
I'm curious if OpenMPI now directly supports
the behavior I need,
described in general in the quote below.
Specifically,
given a single
PBS/Torque allocation of M*N processors, I
will run a
serial program
that will fork M times. Each of the M
forked processes
calls 'mpirun -np N application.exe' and
blocks until completion.
This seems akin to the case you described of
"mpiruns executed in
separate
windows/prompts."
What I'd like to see is the M processes
"tiled" across the
available
slots, so all M*N processors are used.
What I see instead
appears at
face value to be the first N resources being
oversubscribed M times.
Also, when one of the forked processes
returns, I'd like to
be able to
spawn another and have its mpirun schedule
on the resources
freed by
the previous one that exited. Is any
of this possible?
I tried starting an orted (1.3.3, roughly as
you suggested
below), but
got this error:
orted
--daemonize
[gy8:25871] [[INVALID],INVALID]
ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line
125
----------------------------------------------------------------------
---- It looks like orte_init failed for some
reason; your parallel
process is likely to abort. There are
many reasons that a parallel
process can fail during orte_init; some of
which are due to
configuration or environment problems.
This failure
appears to be an
internal failure; here's some additional
information (which
may only
be relevant to an Open MPI
developer):
orte_ess_base_select
failed
--> Returned value Not found (-13)
instead of ORTE_SUCCESS
----------------------------------------------------------------------
---- [gy8:25871] [[INVALID],INVALID]
ORTE_ERROR_LOG: Not
found in file
orted/orted_main.c at line
323
I spared the debugging info as I'm not even
sure this is a correct
invocation...
Thanks for any suggestions you can
offer!
Brian
----------
Brian M. Adams, PhD (briadam@sandia.gov)
Optimization and
Uncertainty
Quantification Sandia National Laboratories,
Albuquerque, NM
http://www.sandia.gov/~briadam
From: Ralph Castain
(rhc_at_[hidden])
Date: 2006-12-12
00:46:59
Hi
Chris
Some of this is doable with today's
code....and one of these
behaviors is not.
:-(
Open MPI/OpenRTE can be run in
"persistent" mode - this allows
multiple jobs to share the same
allocation. This works much as you
describe (syntax is slightly different,
of
course!) - the first mpirun will map using
whatever mode was
requested, then the next mpirun will map
starting from where the
first one left
off.
I *believe* you can run each mpirun in the
background.
However, I don't know if this has really
been tested enough to
support such a claim. All testing that I
know about to-date has
executed mpirun in the foreground - thus,
your example
would execute
sequentially instead of in
parallel.
I know people have tested multiple
mpirun's operating in parallel
within a single allocation (i.e.,
persistent mode) where
the mpiruns
are executed in separate
windows/prompts.
So I suspect you could do something like
you describe -
just haven't
personally verified
it.
Where we definitely differ is that Open
MPI/RTE will *not* block
until resources are freed up from the
prior mpiruns.
Instead, we will attempt to execute each
mpirun immediately - and
will error out the one(s) that try to
execute without sufficient
resources. I imagine we could provide the
kind of "flow
control" you
describe, but I'm not sure when that might
happen.
I am (in my copious free time...haha)
working on an "orteboot"
program that will startup a virtual
machine to make the persistent
mode of operation a little easier. For
now, though, you
can do it by:
1. starting up the "server" using the
following command:
orted --seed --persistent --scope public
[--universe foo]
2. do your mpirun commands. They will
automagically find
the "server"
and connect to it. If you specified a
universe name when
starting the
server, then you must specify the same
universe name on
your mpirun
commands.
When you are done, you will have to
(unfortunately)
manually "kill"
the server and remove its session
directory. I have a
program called
"ortehalt"
in the trunk that will do this cleanly for
you, but it
isn't yet in
the release distributions. You are welcome
to use it,
though, if you
are working with the trunk - I can't
promise it is
bulletproof yet,
but it seems to be
working.
Ralph
On 12/11/06 8:07 PM, "Maestas, Christopher
Daniel"
<cdmaest_at_[hidden]>
wrote:
Hello,
Sometimes we have users that like to do
from within a single job
(think schedule within an job scheduler
allocation):
"mpiexec -n X
myprog"
"mpiexec -n Y
myprog2"
Does mpiexec within Open MPI keep track
of the node list
it
is
using
if it binds to a particular
scheduler?
For example with 4 nodes (2ppn
SMP):
"mpiexec -n 2
myprog"
"mpiexec -n 2
myprog2"
"mpiexec -n 1
myprog3"
And assume this is by-slot allocation we
would have the
following
allocation:
node1 - processor1 -
myprog
- processor2 -
myprog
node2 - processor1 -
myprog2
- processor2 -
myprog2
And for a by-node
allocation:
node1 - processor1 -
myprog
- processor2 -
myprog2
node2 - processor1 -
myprog
- processor2 -
myprog2
I think this is possible using ssh cause
it shouldn't
really matter
how many times it spawns, but with
something like torque
it
would
get
restricted to a max process launch of 4.
We would want the third
mpiexec to block processes and
eventually be run on the first
available node allocation that frees up
from myprog or
myprog2 ....
For example for torque, we had to add
the following to
osc mpiexec:
---
Finally, since only one mpiexec can be
the master at a
time, if
your
code setup requires that mpiexec exit to
get a result,
you
can start
a
"dummy"
mpiexec first in your
batch
job:
mpiexec
-server
It runs no tasks itself but handles the
connections of
other
transient
mpiexec
clients.
It will shut down cleanly when the batch
job exits or you
may kill
the
server
explicitly.
If the server is killed with SIGTERM (or
HUP or INT), it
will exit
with a status of zero if there were no
clients connected
at
the
time.
If there were still clients using the
server, the
server
will kill
all
their tasks, disconnect from the
clients, and exit with status
1.
---
So a user
ran:
mpiexec
-server
mpiexec -n 2
myprog
mpiexec -n 2
myprog2
And the server kept track of the
allocation ... I
would
think that
the
orted could do
this?
Sorry if this sounds confusing ... But
I'm sure it will
clear up
with
any further responses I make. :-)
-cdm
_______________________________________________
users mailing
list
users_at_[hidden]
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users
mailing list
users@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users