Ray Muno wrote:
Rolf Vandevaart wrote:
  
Ray Muno wrote:
    
Ray Muno wrote:
  
      
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE.  MPI communication is over InfiniBand.

    
        
We also have OpenMPI 1.3 installed and receive similar errors.-

  
      
This does sound like a problem with SGE.  By default, we use qrsh to
start the jobs on all the remote nodes.  I believe that is the command
that is failing.  There are two things you can try to get more info
depending on the version of Open MPI.   With version 1.2, you can try
this to get more information.

|--mca pls_gridengine_verbose 1|

    
This did not look like it gave me any more info.

  
With Open MPI 1.3.2 and later the verbose flag will not help.  But
instead, you can disable the use of qrsh and instead use rsh/ssh to
start the remote jobs.

--mca plm_rsh_disable_qrsh 1

    

Tha give me

PMGR_COLLECTIVE ERROR: unitialized MPI task: Missing required
environment variable: MPIRUN_RANK
PMGR_COLLECTIVE ERROR: PMGR_COLLECTIVE ERROR: unitialized MPI task:
Missing required environment variable: MPIRUN_RANK
  
I do not recognize these errors as part of Open MPI.  A google search showed they might be coming from MVAPICH.  Is there a chance we are using Open MPI to launch the jobs (via Open MPI mpirun) but we are actually launching an application that is linked to MVAPICH?

--

=========================
rolf.vandevaart@sun.com
781-442-3043
=========================