Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Martin Schafföner (martin.schaffoener_at_[hidden])
Date: 2006-06-15 09:42:42


Hi,

I have been trying to set up OpenMPI 1.0.3a1r10374 on our cluster and was
partly successful. Partly, because installation worked, compiling a simple
example and running it through the rsh pls also worked. However, I'm the only
user who has rsh access to the nodes, all other users must go through torque
and launch mpi apps using torque's TM subsystem. That's where my problem
starts: I was not successful in launching apps through TM. TM pls is
configured okay, I can see it making connections to torque mom in mom's
logfile; however, the app never gets run. Even if I only request one
processor, mpiexec spawns several orted in a row. Here is my session log
(where I kill mpiexec using CTRL-C cause it would otherwise run forever):

schaffoe_at_node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm
`pwd`/openmpitest
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
--num_procs 2 --vpid_start 0 --nodename --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: found /opt/openmpi/bin/orted
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
--num_procs 3 --vpid_start 0 --nodename --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
--num_procs 4 --vpid_start 0 --nodename --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
mpiexec: killing job...
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
--num_procs 5 --vpid_start 0 --nodename --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
--num_procs 6 --vpid_start 0 --nodename --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
--------------------------------------------------------------------------
WARNING: mpiexec encountered an abnormal exit.

This means that mpiexec exited before it received notification that all
started processes had terminated. You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------

I read in the README that TM pls is working, whereas in the latex usersguide
it says that only rsh and bproc are supported. I am confused...

Can anybody shed a better light on this?

Regards,

-- 
Martin Schafföner
Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063