Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

From: Vittorio Zaccaria (zaccaria_at_[hidden])
Date: 2007-10-17 17:31:06


Dear Reuti and Harvey,

  I just tried by setting control_slaves to TRUE and it works!

Thank you very much,

Vittorio

On Oct 17, 2007, at 7:48 PM, Reuti wrote:

> Hi,
>
> Am 17.10.2007 um 18:49 schrieb Vittorio Zaccaria:
>
>> I am just trying to run a very simple application using mpirun in
>> an SGE 6 environment.
>> The job is called 'example' and it is submitted to the SGE
>> environment with the
>> following command:
>>
>>> qsub -pe parallel 2 example
>>
>> where 'parallel' is a working parallel environment.
>> 'example' is a very simple script which executes the command
>> 'hostname' on two MPI nodes (I enabled some debug options):
>>
>> mpirun --debug-daemons --mca pls_gridengine_debug 1 --mca
>> pls_rsh_agent ssh --prefix /home/dei/931277/openmpi/build/image --
>> mca pls_gridengine_verbose 1 -np 2 hostname
>>
>> The job fails with the following output:
>>
>> [compute-1-16:11144] pls:gridengine: final template argv:
>> [compute-1-16:11144] pls:gridengine: qrsh -inherit -noshell -
>> nostdin -V -verbose <template> orted --no-daemonize --debug-daemons
>> --bootprox
>> y 1 --name <template> --num_procs 3 --vpid_start 0 --nodename
>> <template> --universe 931277_at_compute-1-16:default-universe-11144 --
>> nsreplica "0.0
>> .0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076" --gprreplica
>> "0.0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
>> [compute-1-16:11144] pls:gridengine: reset PATH: /home/dei/931277/
>> openmpi/build/image/bin:/home/dei/931277/openmpi/build/image/bin:/
>> home/dei/93
>> 1277/gsl/build/image/bin:/home/dei/931277/openmpi/build/image/bin:/
>> home/dei/931277/openmpi/build/image/bin:/home/dei/931277/gsl/build/
>> image/bin
>> :/apps/sge6/bin/lx24-amd64:/usr/kerberos/bin:/scratch/
>> 11780.1.all.q:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:/apps/
>> local/bin:/opt/intel_fc_8
>> 0/bin:/apps/pgi/linux86/5.1/bin:/home/dei/931277/bin:/home/dei/
>> 931277/bin:/home/dei/931277/bin
>> [compute-1-16:11144] pls:gridengine: reset LD_LIBRARY_PATH: /home/
>> dei/931277/openmpi/build/image/lib:/home/dei/931277/openmpi/build/
>> image/lib:/
>> home/dei/931277/gsl/build/image/lib:/home/dei/931277/readline/build/
>> image/lib
>> [compute-1-16:11144] pls:gridengine: launching on node
>> compute-1-16.hpc.polimi.it
>> [compute-1-16:11144] pls:gridengine: parent
>> [compute-1-16:11144] pls:gridengine: launching on node
>> compute-1-8.hpc.polimi.it
>> [compute-1-16:11144] pls:gridengine: parent
>> [compute-1-16:11144] pls:gridengine: exec_argv[0]=qrsh, exec_path=//
>> apps/sge6/bin/lx24-amd64/qrsh
>> [compute-1-16:11144] pls:gridengine: exec_argv[0]=qrsh, exec_path=//
>> apps/sge6/bin/lx24-amd64/qrsh
>> [compute-1-16:11144] pls:gridengine: orted_path=/home/dei/931277/
>> openmpi/build/image/bin/orted
>> [compute-1-16:11144] pls:gridengine: changing to directory /home/
>> dei/931277
>> [compute-1-16:11144] pls:gridengine: orted_path=/home/dei/931277/
>> openmpi/build/image/bin/orted
>> [compute-1-16:11144] pls:gridengine: changing to directory /home/
>> dei/931277
>> [compute-1-16:11144] pls:gridengine: executing: qrsh -inherit -
>> noshell -nostdin -V -verbose compute-1-16.hpc.polimi.it /home/dei/
>> 931277/openmpi
>> /build/image/bin/orted --no-daemonize --debug-daemons --bootproxy 1
>> --name 0.0.1 --num_procs 3 --vpid_start 0 --nodename
>> compute-1-16.hpc.polim
>> i.it --universe 931277_at_compute-1-16:default-universe-11144 --
>> nsreplica "0.0.0;tcp://192.168.1.116:33076;tcp://
>> 172.16.1.116:33076" --gprreplica
>> "0.0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
>> [compute-1-16:11144] pls:gridengine: executing: qrsh -inherit -
>> noshell -nostdin -V -verbose compute-1-8.hpc.polimi.it /home/dei/
>> 931277/openmpi/
>> build/image/bin/orted --no-daemonize --debug-daemons --bootproxy 1
>> --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename
>> compute-1-8.hpc.polimi.
>> it --universe 931277_at_compute-1-16:default-universe-11144 --
>> nsreplica "0.0.0;tcp://192.168.1.116:33076;tcp://
>> 172.16.1.116:33076" --gprreplica "0
>> .0.0;tcp://192.168.1.116:33076;tcp://172.16.1.116:33076"
>> Starting server daemon at host "compute-1-8.hpc.polimi.it"
>> Starting server daemon at host "compute-1-16.hpc.polimi.it"
>> error: executing task of job 11780 failed:
>> error: executing task of job 11780 failed:
>> [compute-1-16:11144] ERROR: A daemon on node
>> compute-1-8.hpc.polimi.it failed to start as expected.
>> [compute-1-16:11144] ERROR: There may be more information available
>> from
>> [compute-1-16:11144] ERROR: the 'qstat -t' command on the Grid
>> Engine tasks.
>> [compute-1-16:11144] ERROR: If the problem persists, please restart
>> the
>> [compute-1-16:11144] ERROR: Grid Engine PE job
>> [compute-1-16:11144] ERROR: The daemon exited unexpectedly with
>> status 1.
>> [compute-1-16:11144] ERROR: A daemon on node
>> compute-1-16.hpc.polimi.it failed to start as expected.
>> [compute-1-16:11144] ERROR: There may be more information available
>> from
>> [compute-1-16:11144] ERROR: the 'qstat -t' command on the Grid
>> Engine tasks.
>> [compute-1-16:11144] ERROR: If the problem persists, please restart
>> the
>> [compute-1-16:11144] ERROR: Grid Engine PE job
>> [compute-1-16:11144] ERROR: The daemon exited unexpectedly with
>> status 1.
>>
>>
>> It seems that 'orted' daemons just fail to start due to some
>> reason, but no message
>> is given:
>>
>> error: executing task of job 11780 failed:
>>
>> Executing 'qstat -t' shows two pending jobs, one marked as MASTER,
>> the other marked as SLAVE.
>>
>> Please note that if I run mpirun directly from the command line, it
>> just works fine.
>>
>> Any suggestions?
>
> when running under SGE it will try to use the qrsh command. For this
> to work, control_slaves must be set to true, and there must not run
> any firewall on the machines, as a random port will be used for
> communication.
>
> -- Reuti
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Vittorio Zaccaria, Ph. D.
Politecnico di Milano
Dipartimento di Elettronica e Informazione
Via Giuseppe Ponzio 34/5 - 20133 Milano
E-mail: zaccaria_at_[hidden]