Hi there,

 

Recently, I’ve begun some calculations on a cluster where I submit a multiple node job to the Torque batch system, and the job executes multiple single-node parallel tasks.  That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes. 

 

Some background: The actual program being executed is Q-Chem 4.0.  I use OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile and this is the last known version of OpenMPI that this version of Q-Chem is known to work with.

 

My jobs are failing with the error message below; I do not observe this error when submitting single-node jobs.  From reading the mailing list archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php), I believe it is looking for a PBS node file somewhere.  Since my jobs are only parallel over the node they’re running on, I believe that a node file of any kind is unnecessary. 

 

My question is: Why is OpenMPI behaving differently when I submit a multi-node job compared to a single-node job?  How does OpenMPI detect that it is running under a multi-node allocation?  Is there a way I can change OpenMPI’s behavior so it always thinks it’s running on a single node, regardless of the type of job I submit to the batch system?

 

Thank you,

 

-          Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University)

 

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167