Hi all,
I am trying to setup a small SGE cluster with OpenMPI integrated but I am totally stuck when trying to run a openmpi job to the SGE's PE.
I mainly followed the guide sge-snow.pdf from Revolutions Computing and http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html
The cluster is entirely ubuntu 10.10 based, both SGE 6.2u5 and OpenMPI 1.3 are directly from apt-get except OpenMPI is rebuilt from source with --with-sge flag.
Note: OpenMPI has been installed on all execution hosts, not on the queue master and submission host.
I submited a job by
qsub -pe orte 8 ./ompi_job.sh
The error I got looks like
=============================================================================================================================
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c
at line 161
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_plm_base_select failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../orte/runtime/orte_init.c
at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_ess_set_name failed
--> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[sgeqexec01:06612] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ../../../../../orte/tools/orterun/orterun.c
at line 541
==============================================================================================================================
For troubleshooting I have done several things below:
1) passwordless SSH has been configurated properly for the execution hosts and the queue master.
pwbcad@sgeqmast01:~$ ssh sgeqexec01 uptime
14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02
2) I could run a openmpi job outside the SGE successfully.
mpirun -host n1, n2 -np 8 ./ompi_job
3) I submitted job to a queue directly instead of a PE, the job could run and completed successfully
qsub -q dev.q ./ompi_job.sh
4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't help.
It will be very appreciated if anyone can share their experience!
Derrick