Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Try to submit OMPI job to SGE gives ERRORS (orte_plm_base_select failed & orte_ess_set_name failed)
From: Reuti (reuti_at_[hidden])
Date: 2011-04-15 04:49:23


Am 15.04.2011 um 06:53 schrieb Derrick LIN:

> I am trying to setup a small SGE cluster with OpenMPI integrated but I am totally stuck when trying to run a openmpi job to the SGE's PE.
> I mainly followed the guide sge-snow.pdf from Revolutions Computing and

- what is your SGE configuration `qconf -sconf`?

> <snip>
> For troubleshooting I have done several things below:
> 1) passwordless SSH has been configurated properly for the execution hosts and the queue master.
> pwbcad_at_sgeqmast01:~$ ssh sgeqexec01 uptime
> 14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02

a) you are testing from master to a node, but jobs are running between nodes.

b) unless you need X11 forwarding, using SGE’s -builtin- communication works fine, this way you can have a cluster without `rsh` or `ssh` (or limited to admin staff) and can still run parallel jobs.

> 2) I could run a openmpi job outside the SGE successfully.
> mpirun -host n1, n2 -np 8 ./ompi_job
> 3) I submitted job to a queue directly instead of a PE, the job could run and completed successfully
> qsub -q dev.q ./

Then you are bypassing SGE’s slot allocation and will have wrong accounting and no job control of the slave tasks.

-- Reuti

> 4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't help.
> It will be very appreciated if anyone can share their experience!
> Derrick
> _______________________________________________
> users mailing list
> users_at_[hidden]