Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)
From: Reuti (reuti_at_[hidden])
Date: 2013-02-08 14:15:57


Hi,

Am 08.02.2013 um 19:36 schrieb Pierre LINDENBAUM:

> ( cross-posted on SO: http://stackoverflow.com/questions/14775451 )
> I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE:
>
>
> I've installed openmpi , not in
> /usr/...
> but in
> /commun/data/packages/openmpi/
>
> it was compiled with --with-sge.
>
> I've added a new PE in SGE with qconf as descibed in
> http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html
>
> # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)
>
> # qconf -sq all.q | grep pe_
> pe_list make orte
>
> Without SGE, the program runs without any problem, using several processors.
>
> /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args
>
> Now I want to submit my program to SGE
>
> In the Open MPI FAQ, I read:
>
> # Allocate a SGE interactive job with 4 slots
> # from a parallel environment (PE) named 'orte'
> shell$ qsh -pe orte 4
>
> but my output is:
>
> qsh -pe orte 4
> Your job 84550 ("INTERACTIVE") has been submitted
> waiting for interactive job to be scheduled ...
> Could not start interactive job.

An INTERACTIVE job is more like an immediate job, i.e. "-now y". Do you have any interactive queue configured and the cluster is empty right now?

> I've also tried the mpirun command embedded in a script:
>
> $ cat ompi.sh
> #!/bin/sh
> /commun/data/packages/openmpi/bin/mpirun \
> /path/to/a.out args
>
> but it fails
>
> $ cat ompi.sh.e84552
> error: executing task of job 84552 failed: execution daemon on host
> "node02" didn't accept task

This is a good sign, as it tries to use `qrsh -inherit ...` already. Can you confirm the following settings:

$ qconf -sp orte
...
control_slaves TRUE

$ qconf -sq all.q
...
shell_start_mode unix_behavior

-- Reuti

> --------------------------------------------------------------------------
> A daemon (pid 18327) died unexpectedly with status 1 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> error: executing task of job 84552 failed: execution daemon on host
> "node01" didn't accept task
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
>
> How can I fix this?
>
> Many thanks
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users