On Tue, Dec 8, 2009 at 12:01 PM, Ross Boylan <email@example.com>
What is the difference between running a set of programs with
independent invocations of mpirun vs specifying --app? The programs do
not need to talk to each other.
I think that if one job fails it will take the others down if I use
--app. Is that correct? This is the main reason I'm considering
Yes - the job is terminated in that situation.
On the other hand, if my app file is something like
-np 1 prog1
-np 1 prog2
I believe I will avoid oversubcription. But, if I do
mpirun -np 1 prog1
miprun -np 1 prog2
do I end up oversubscribing the first node?
Yes - each invocation of mpirun has no idea what the other one is doing. So they will both load their procs beginning with the first available node.
It would also be nice if OMPI automatically picked the least loaded node
(the load might come from other programs), but it does not appear it
takes this into account. Is that right? The FAQ mentions load leveler,
but we don't seem to have it installed.
Can you update to 1.3.4? If so, you can level the loading by using
--loadbalance on the cmd line and OMPI will map your procs accordingly.
Context: we have a cluster without a batch system or scheduler, and want
to run multiple independent jobs at once. The cluster is running Debian
Lenny -> OMPI 1.2.7rc2.
We have a subproject called Open Resilient Cluster Manager that will allow the job to continue when individual procs die. Not released yet, but you can see the project at
I have used those techniques to modify mpirun to support process continuation (to be committed to the devel trunk soon, for release later), but the MPI connection restoration is still being worked. So it works fine for non-MPI applications, but won't help for MPI apps right now.
I will probably modify mpirun at the same time to allow independent jobs to continue running if one job fails. This will require a flag to mpirun, though, as otherwise it would be very hard for me to know that the jobs are in fact independent - the runtime layer doesn't know what MPI connections are being made.
Thanks for any help you can offer.
users mailing list