Ick is the proper response. :-)
The old 1.2 series would attempt to spawn a local orted on each of
those nodes, and that is what is failing. Best guess is that it is
because pbsdsh doesn't fully replicate a key part of the environment
that is expected.
One thing you could try is do this with 1.3.1. It will just fork/exec
that local application instead of trying to start a daemon, so the
odds are much better that it will work.
I don't know of any native way to get mpirun to launch a farm - it
will always set the comm_size to the total #procs. I suppose we could
add that option, if people want it - wouldn't be very hard to implement.
On Apr 1, 2009, at 8:49 AM, Brock Palen wrote:
> Ok this is weird, and the correct answer is probably "don't do that",
> User wants to run many many small jobs, faster than our scheduler
> +torque can start, he uses pbsdsh to start them in parallel, under tm.
> pbsdsh bash -c 'cd $PBS_O_WORKDIR/$PBS_VNODENUM; mpirun -np 1
> This is kinda silly because the code while MPI based, when ran on
> single rank does not require mpirun to start, and just just fine if
> you leave off mpirun.
> What happens though if you do leave it on (this is with ompi-1.2.x)
> you get errors about
> [nyx428.engin.umich.edu:01929] pls:tm: failed to poll for a spawned
> proc, return status = 17002
> [nyx428.engin.umich.edu:01929] [0,0,0] ORTE_ERROR_LOG: In errno in
> file rmgr_urm.c at line 462
> Kinda makes sense, pbsdsh has already started 'mpirun' under tm, and
> now mpirun is trying to start a process also under tm. In fact with
> older versions (1.2.0). The above will work fine only for the first
> TMNODE, any second node, will hang, at 'poll()' if you strace it.
> To we can solve the above by not using mpirun to start single
> processes under tm that were spawned by tm in the first place. Just
> thought you would like to know.
> Is there a way to have mpirun spawn all the processes like pbsdsh?
> Problem is the code is MPI based, so if you say 'run 4' its going
> to do the noraml COMM_SIZE=4, only read first input, etc. Also we
> have to change the CWD of each rank. Thus can you make mpirun farm?
> Brock Palen
> Center for Advanced Computing
> users mailing list