Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] Error launching single-node tasks from multiple-node job.
From: Lee-Ping Wang (leeping_at_[hidden])
Date: 2013-08-10 13:51:00


Hi there,

 

Recently, I've begun some calculations on a cluster where I submit a
multiple node job to the Torque batch system, and the job executes multiple
single-node parallel tasks. That is to say, these tasks are intended to use
OpenMPI parallelism on each node, but no parallelism across nodes.

 

Some background: The actual program being executed is Q-Chem 4.0. I use
OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile
and this is the last known version of OpenMPI that this version of Q-Chem is
known to work with.

 

My jobs are failing with the error message below; I do not observe this
error when submitting single-node jobs. From reading the mailing list
archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php),
I believe it is looking for a PBS node file somewhere. Since my jobs are
only parallel over the node they're running on, I believe that a node file
of any kind is unnecessary.

 

My question is: Why is OpenMPI behaving differently when I submit a
multi-node job compared to a single-node job? How does OpenMPI detect that
it is running under a multi-node allocation? Is there a way I can change
OpenMPI's behavior so it always thinks it's running on a single node,
regardless of the type of job I submit to the batch system?

 

Thank you,

 

- Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford
University)

 

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 153

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file ras_tm_module.c at line 87

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file base/ras_base_allocate.c at line 133

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file base/plm_base_launch_support.c at line 72

[compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167

[compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167

[compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in
file plm_tm_module.c at line 167