Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] newbie: Submitting Open MPI jobs to SGE ( `qsh -pe orte 4` fails)
From: Pierre LINDENBAUM (pierre.lindenbaum_at_[hidden])
Date: 2013-02-08 13:36:38


( cross-posted on SO: http://stackoverflow.com/questions/14775451 )

Hi,
I'm very new to OpenMpi and I'm trying tosubmit OMPI to SGE:

I've installed openmpi , not in
  /usr/...
but in
   /commun/data/packages/openmpi/

it was compiled with --with-sge.

I've added a new PE in SGE with qconf as descibed in
http://docs.oracle.com/cd/E19080-01/n1.grid.eng6/817-5677/6ml49n2c0/index.html

  # /commun/data/packages/openmpi/bin/ompi_info | grep gridengine
  MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)

  # qconf -sq all.q | grep pe_
  pe_list make orte

Without SGE, the program runs without any problem, using several processors.

       /commun/data/packages/openmpi/bin/orterun -np 20 ./a.out args

Now I want to submit my program to SGE

In the Open MPI FAQ, I read:

  # Allocate a SGE interactive job with 4 slots
  # from a parallel environment (PE) named 'orte'
  shell$ qsh -pe orte 4

but my output is:

   qsh -pe orte 4
   Your job 84550 ("INTERACTIVE") has been submitted
   waiting for interactive job to be scheduled ...
   Could not start interactive job.

I've also tried the mpirun command embedded in a script:

   $ cat ompi.sh
   #!/bin/sh
   /commun/data/packages/openmpi/bin/mpirun \
         /path/to/a.out args

but it fails

  $ cat ompi.sh.e84552
  error: executing task of job 84552 failed: execution daemon on host
"node02" didn't accept task
   --------------------------------------------------------------------------
  A daemon (pid 18327) died unexpectedly with status 1 while attempting
  to launch so we are aborting.

  There may be more information reported by the environment (see above).

  This may be because the daemon was unable to find all the needed shared
  libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
  location of the shared libraries on the remote nodes and this will
  automatically be forwarded to the remote nodes.
  --------------------------------------------------------------------------
  error: executing task of job 84552 failed: execution daemon on host
"node01" didn't accept task
  --------------------------------------------------------------------------
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.

How can I fix this?

Many thanks