Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Brock Palen (brockp_at_[hidden])
Date: 2006-06-15 10:05:19


I dont know about ompi-1.0.3 snapshots, but we use ompi-1.0.2 with
both torque-2.0.0p8 and torque-2.1.0p0 using the tm interface without
any problems.
Are you using PBSPro? OpenPBS?
As for you mpiexec is that the one included with OpenMPI (just a
symlink to orterun) or the one from
http://www.osc.edu/~pw/mpiexec/index.php

Brock Palen
Center for Advanced Computing
brockp_at_[hidden]
(734)936-1985

On Jun 15, 2006, at 9:42 AM, Martin Schafföner wrote:

> Hi,
>
> I have been trying to set up OpenMPI 1.0.3a1r10374 on our cluster
> and was
> partly successful. Partly, because installation worked, compiling a
> simple
> example and running it through the rsh pls also worked. However,
> I'm the only
> user who has rsh access to the nodes, all other users must go
> through torque
> and launch mpi apps using torque's TM subsystem. That's where my
> problem
> starts: I was not successful in launching apps through TM. TM pls is
> configured okay, I can see it making connections to torque mom in
> mom's
> logfile; however, the app never gets run. Even if I only request one
> processor, mpiexec spawns several orted in a row. Here is my
> session log
> (where I kill mpiexec using CTRL-C cause it would otherwise run
> forever):
>
> schaffoe_at_node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --
> mca pls tm
> `pwd`/openmpitest
> [node16:03113] pls:tm: final top-level argv:
> [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
> --num_procs 2 --vpid_start 0 --nodename --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: launching on node node16
> [node16:03113] pls:tm: found /opt/openmpi/bin/orted
> [node16:03113] pls:tm: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy
> 1 --name
> 0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: final top-level argv:
> [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
> --num_procs 3 --vpid_start 0 --nodename --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: launching on node node16
> [node16:03113] pls:tm: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy
> 1 --name
> 0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: final top-level argv:
> [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
> --num_procs 4 --vpid_start 0 --nodename --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: launching on node node16
> [node16:03113] pls:tm: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy
> 1 --name
> 0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> mpiexec: killing job...
> [node16:03113] pls:tm: final top-level argv:
> [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
> --num_procs 5 --vpid_start 0 --nodename --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: launching on node node16
> [node16:03113] pls:tm: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy
> 1 --name
> 0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: final top-level argv:
> [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name
> --num_procs 6 --vpid_start 0 --nodename --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> [node16:03113] pls:tm: launching on node node16
> [node16:03113] pls:tm: not oversubscribed -- setting
> mpi_yield_when_idle to 0
> [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy
> 1 --name
> 0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe
> schaffoe_at_node16:default-universe-3113 --nsreplica
> "0.0.0;tcp://192.168.1.16:60601" --gprreplica
> "0.0.0;tcp://192.168.1.16:60601"
> ----------------------------------------------------------------------
> ----
> WARNING: mpiexec encountered an abnormal exit.
>
> This means that mpiexec exited before it received notification that
> all
> started processes had terminated. You should double check and ensure
> that there are no runaway processes still executing.
> ----------------------------------------------------------------------
> ----
>
>
> I read in the README that TM pls is working, whereas in the latex
> usersguide
> it says that only rsh and bproc are supported. I am confused...
>
> Can anybody shed a better light on this?
>
> Regards,
> --
> Martin Schafföner
>
> Cognitive Systems Group, Institute of Electronics, Signal
> Processing and
> Communication Technologies, Department of Electrical Engineering,
> Otto-von-Guericke University Magdeburg
> Phone: +49 391 6720063
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users