Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Martin Schafföner (martin.schaffoener_at_[hidden])
Date: 2006-06-16 09:49:39


On Friday 16 June 2006 15:00, Jeff Squyres (jsquyres) wrote:
> Try two things:
>
> 1. Use the pbsdsh command to try to launch a trivial non-MPI application
> (e.g., hostname):
>
> (inside a PBS job)
> pbsdsh -<N> -v hostname
>
> where <N> is the number of vcpu's in your job.
>
> 2. If that works, try mpirun'ing a trivial non-MPI application (e.g.,
> hostname):
>
> (inside a PBS job)
> mpirun -np <N> -d --mca pls_tm_debug 1 hostname
>
> If #1 fails, then there is something wrong with your Torque installation
> (pbsdsh uses the same PBS API that Open MPI does), and Open MPI's failure
> is a symptom of the underlying problem. If #1 succeeds and #2 fails, send
> back the details and let's go from there.

So, #1 works (I know because we're constantly using pbsdsh and OSC's mpiexec
for mpich-type implementations). #2 doesn't work; I'll repeat the session log
from my first post, this time (hopefully!!!) with linebreaks (could it be
that the mailing list has problems with utf8 posts?):

schaffoe_at_node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm
`pwd`/openmpitest
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 2 --vpid_start 0 --nodename  --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: found /opt/openmpi/bin/orted
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 3 --vpid_start 0 --nodename  --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 4 --vpid_start 0 --nodename  --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
mpiexec: killing job...
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 5 --vpid_start 0 --nodename  --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 6 --vpid_start 0 --nodename  --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name
0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe
schaffoe_at_node16:default-universe-3113 --nsreplica
"0.0.0;tcp://192.168.1.16:60601" --gprreplica
"0.0.0;tcp://192.168.1.16:60601"
--------------------------------------------------------------------------
WARNING: mpiexec encountered an abnormal exit.

This means that mpiexec exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------

CU,

-- 
Martin Schafföner
Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063