Ray Muno wrote:
> Ray Muno wrote:
>> We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
>> Scheduling is done through SGE. MPI communication is over InfiniBand.
> We also have OpenMPI 1.3 installed and receive similar errors.-
This does sound like a problem with SGE. By default, we use qrsh to
start the jobs on all the remote nodes. I believe that is the command
that is failing. There are two things you can try to get more info
depending on the version of Open MPI. With version 1.2, you can try
this to get more information.
|--mca pls_gridengine_verbose 1|
With Open MPI 1.3.2 and later the verbose flag will not help. But
instead, you can disable the use of qrsh and instead use rsh/ssh to
start the remote jobs.
--mca plm_rsh_disable_qrsh 1
Maybe trying one or both of these might provide some extra clues.