Dear Reuti and Harvey,

 I just tried by setting control_slaves to TRUE and it works!

Thank you very much,

Vittorio


On Oct 17, 2007, at 7:48 PM, Reuti wrote:

Hi,

Am 17.10.2007 um 18:49 schrieb Vittorio Zaccaria:

 I am just trying to run a very simple application using mpirun in  
an SGE 6 environment.
The job is called 'example' and it is submitted to the SGE  
environment with the
following command:

qsub -pe parallel 2 example

where 'parallel' is a working parallel environment.
'example' is a very simple script which executes the command  
'hostname' on two MPI nodes (I enabled some debug options):

mpirun --debug-daemons --mca pls_gridengine_debug 1 --mca  
pls_rsh_agent ssh --prefix /home/dei/931277/openmpi/build/image -- 
mca pls_gridengine_verbose 1 -np 2 hostname

The job fails with the following output:

[compute-1-16:11144] pls:gridengine: final template argv:
[compute-1-16:11144] pls:gridengine:     qrsh -inherit -noshell - 
nostdin -V -verbose <template> orted --no-daemonize --debug-daemons  
--bootprox
y 1 --name <template> --num_procs 3 --vpid_start 0 --nodename  
<template> --universe 931277@compute-1-16:default-universe-11144 -- 
nsreplica "0.0
.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076" --gprreplica  
"0.0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
[compute-1-16:11144] pls:gridengine: reset PATH: /home/dei/931277/ 
openmpi/build/image/bin:/home/dei/931277/openmpi/build/image/bin:/ 
home/dei/93
1277/gsl/build/image/bin:/home/dei/931277/openmpi/build/image/bin:/ 
home/dei/931277/openmpi/build/image/bin:/home/dei/931277/gsl/build/ 
image/bin
:/apps/sge6/bin/lx24-amd64:/usr/kerberos/bin:/scratch/ 
11780.1.all.q:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/apps/ 
local/bin:/opt/intel_fc_8
0/bin:/apps/pgi/linux86/5.1/bin:/home/dei/931277/bin:/home/dei/ 
931277/bin:/home/dei/931277/bin
[compute-1-16:11144] pls:gridengine: reset LD_LIBRARY_PATH: /home/ 
dei/931277/openmpi/build/image/lib:/home/dei/931277/openmpi/build/ 
image/lib:/
home/dei/931277/gsl/build/image/lib:/home/dei/931277/readline/build/ 
image/lib
[compute-1-16:11144] pls:gridengine: launching on node  
compute-1-16.hpc.polimi.it
[compute-1-16:11144] pls:gridengine: parent
[compute-1-16:11144] pls:gridengine: launching on node  
compute-1-8.hpc.polimi.it
[compute-1-16:11144] pls:gridengine: parent
[compute-1-16:11144] pls:gridengine: exec_argv[0]=qrsh, exec_path=// 
apps/sge6/bin/lx24-amd64/qrsh
[compute-1-16:11144] pls:gridengine: exec_argv[0]=qrsh, exec_path=// 
apps/sge6/bin/lx24-amd64/qrsh
[compute-1-16:11144] pls:gridengine: orted_path=/home/dei/931277/ 
openmpi/build/image/bin/orted
[compute-1-16:11144] pls:gridengine: changing to directory /home/ 
dei/931277
[compute-1-16:11144] pls:gridengine: orted_path=/home/dei/931277/ 
openmpi/build/image/bin/orted
[compute-1-16:11144] pls:gridengine: changing to directory /home/ 
dei/931277
[compute-1-16:11144] pls:gridengine: executing: qrsh -inherit - 
noshell -nostdin -V -verbose compute-1-16.hpc.polimi.it /home/dei/ 
931277/openmpi
/build/image/bin/orted --no-daemonize --debug-daemons --bootproxy 1  
--name 0.0.1 --num_procs 3 --vpid_start 0 --nodename  
compute-1-16.hpc.polim
i.it --universe 931277@compute-1-16:default-universe-11144 -- 
nsreplica "0.0.0;tcp://192.168.1.116:33076;tcp:// 
172.16.1.116:33076" --gprreplica
"0.0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
[compute-1-16:11144] pls:gridengine: executing: qrsh -inherit - 
noshell -nostdin -V -verbose compute-1-8.hpc.polimi.it /home/dei/ 
931277/openmpi/
build/image/bin/orted --no-daemonize --debug-daemons --bootproxy 1  
--name 0.0.2 --num_procs 3 --vpid_start 0 --nodename  
compute-1-8.hpc.polimi.
it --universe 931277@compute-1-16:default-universe-11144 -- 
nsreplica "0.0.0;tcp://192.168.1.116:33076;tcp:// 
172.16.1.116:33076" --gprreplica "0
.0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
Starting server daemon at host "compute-1-8.hpc.polimi.it"
Starting server daemon at host "compute-1-16.hpc.polimi.it"
error: executing task of job 11780 failed:
error: executing task of job 11780 failed:
[compute-1-16:11144] ERROR: A daemon on node  
compute-1-8.hpc.polimi.it failed to start as expected.
[compute-1-16:11144] ERROR: There may be more information available  
from
[compute-1-16:11144] ERROR: the 'qstat -t' command on the Grid  
Engine tasks.
[compute-1-16:11144] ERROR: If the problem persists, please restart  
the
[compute-1-16:11144] ERROR: Grid Engine PE job
[compute-1-16:11144] ERROR: The daemon exited unexpectedly with  
status 1.
[compute-1-16:11144] ERROR: A daemon on node  
compute-1-16.hpc.polimi.it failed to start as expected.
[compute-1-16:11144] ERROR: There may be more information available  
from
[compute-1-16:11144] ERROR: the 'qstat -t' command on the Grid  
Engine tasks.
[compute-1-16:11144] ERROR: If the problem persists, please restart  
the
[compute-1-16:11144] ERROR: Grid Engine PE job
[compute-1-16:11144] ERROR: The daemon exited unexpectedly with  
status 1.


It seems that 'orted' daemons just fail to start due to some  
reason, but no message
is given:

error: executing task of job 11780 failed:

Executing 'qstat -t' shows two pending jobs, one marked as MASTER,  
the other marked as SLAVE.

Please note that if I run mpirun directly from the command line, it  
just works fine.

Any suggestions?

when running under SGE it will try to use the qrsh command. For this  
to work, control_slaves  must be set to true, and there must not run  
any firewall on the machines, as a random port will be used for  
communication.

-- Reuti
_______________________________________________
users mailing list
users@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Vittorio Zaccaria, Ph. D.
Politecnico di Milano
Dipartimento di Elettronica e Informazione
Via Giuseppe Ponzio 34/5 - 20133 Milano
E-mail: zaccaria@elet.polimi.it