Ray Muno wrote:
Ray Muno wrote:
  
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.

    

We also have OpenMPI 1.3 installed and receive similar errors.-

  
This does sound like a problem with SGE.  By default, we use qrsh to start the jobs on all the remote nodes.  I believe that is the command that is failing.  There are two things you can try to get more info depending on the version of Open MPI.   With version 1.2, you can try this to get more information.

--mca pls_gridengine_verbose 1

With Open MPI 1.3.2 and later the verbose flag will not help.  But instead, you can disable the use of qrsh and instead use rsh/ssh to start the remote jobs.

--mca plm_rsh_disable_qrsh 1

Maybe trying one or both of these might provide some extra clues.

Rolf




-- 

=========================
rolf.vandevaart@sun.com
781-442-3043
=========================