Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] openmpi 1.3 and gridengine tight integrationproblem
From: Rene Salmon (salmr0_at_[hidden])
Date: 2009-03-18 11:52:46


>
> At this FAQ, we show an example of a parallel environment setup.
> http://www.open-mpi.org/faq/?category=running#run-n1ge-or-sge
>
> I am wondering if the control_slaves needs to be TRUE.
> And double check the that the PE (pavtest) is on the list for the
> queue
> (also mentioned at the FAQ). And perhaps start trying to run hostname
> first.

Changing control_slaves to true did not make thing work but it did
provide me with a bit more info on this. Enough to figure things out.
Now when I run i get a message about "rcmd:socket: Permission denied" :

Starting server daemon at host "hpcp7782"
Server daemon successfully started with task id "1.hpcp7782"
Establishing /hpc/SGE/utilbin/lx24-amd64/rsh session to host
hpcp7782 ...
rcmd: socket: Permission denied
/hpc/SGE/utilbin/lx24-amd64/rsh exited with exit code 1
reading exit code from shepherd ... timeout (60 s) expired while waiting
on socket fd 4
error: error reading returncode of remote command
--------------------------------------------------------------------------
A daemon (pid 31961) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------
mpirun: clean termination accomplished

[hpcp7781:31960] mca: base: close: component rsh closed
[hpcp7781:31960] mca: base: close: unloading component rsh

So it turns out the NFS mount for SGE on the clients had option "nosuid"
set which does not allow the qrsh/rsh SGE binaries to run because they
are setuid. Got rid of the "nosuid" and now things work just fine.

Thank you for the help

Rene