Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed)
From: Derrick LIN (klin938_at_[hidden])
Date: 2011-04-15 00:53:00


Hi all,

I am trying to setup a small SGE cluster with OpenMPI integrated but I am
totally stuck when trying to run a openmpi job to the SGE's PE.

I mainly followed the guide sge-snow.pdf from Revolutions Computing and
http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html

The cluster is entirely ubuntu 10.10 based, both SGE 6.2u5 and OpenMPI 1.3
are directly from apt-get except OpenMPI is rebuilt from source with
--with-sge flag.

Note: OpenMPI has been installed on all execution hosts, not on the queue
master and submission host.

I submited a job by

qsub -pe orte 8 ./ompi_job.sh

The error I got looks like
=============================================================================================================================

[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c
at line 161
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../orte/runtime/orte_init.c
at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in
file ../../../../../orte/tools/orterun/orterun.c
at line 541

==============================================================================================================================

For troubleshooting I have done several things below:

1) passwordless SSH has been configurated properly for the execution hosts
and the queue master.

pwbcad_at_sgeqmast01:~$ ssh sgeqexec01 uptime
 14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02

2) I could run a openmpi job outside the SGE successfully.

mpirun -host n1, n2 -np 8 ./ompi_job

3) I submitted job to a queue directly instead of a PE, the job could run
and completed successfully

qsub -q dev.q ./ompi_job.sh

4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in
ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't
help.

It will be very appreciated if anyone can share their experience!

Derrick