Ray Muno wrote:
We are running a cluster using Rocks 5.0 and OpenMPI 1.2 (primarily).
Scheduling is done through SGE. MPI communication is over InfiniBand.
We also have OpenMPI 1.3 installed and receive similar errors.-
This does sound like a problem with SGE. By default, we use qrsh to
start the jobs on all the remote nodes. I believe that is the command
that is failing. There are two things you can try to get more info
depending on the version of Open MPI. With version 1.2, you can try
this to get more information.