Am 15.04.2011 um 06:53 schrieb Derrick LIN:
> I am trying to setup a small SGE cluster with OpenMPI integrated but I am totally stuck when trying to run a openmpi job to the SGE's PE.
> I mainly followed the guide sge-snow.pdf from Revolutions Computing and http://idolinux.blogspot.com/2010/04/quick-install-of-open-mpi-with-grid.html
- what is your SGE configuration `qconf -sconf`?
> For troubleshooting I have done several things below:
> 1) passwordless SSH has been configurated properly for the execution hosts and the queue master.
> pwbcad_at_sgeqmast01:~$ ssh sgeqexec01 uptime
> 14:35:54 up 2:47, 1 user, load average: 0.10, 0.08, 0.02
a) you are testing from master to a node, but jobs are running between nodes.
b) unless you need X11 forwarding, using SGEs -builtin- communication works fine, this way you can have a cluster without `rsh` or `ssh` (or limited to admin staff) and can still run parallel jobs.
> 2) I could run a openmpi job outside the SGE successfully.
> mpirun -host n1, n2 -np 8 ./ompi_job
> 3) I submitted job to a queue directly instead of a PE, the job could run and completed successfully
> qsub -q dev.q ./ompi_job.sh
Then you are bypassing SGEs slot allocation and will have wrong accounting and no job control of the slave tasks.
> 4) Although I don't think PATH and LD_LIBRARY_PATH would cause issues in ubuntu, I still add OpenMPI binaries and libraries to both. But it didn't help.
> It will be very appreciated if anyone can share their experience!
> users mailing list