Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job.
From: Gustavo Correa (gus_at_[hidden])
Date: 2013-08-10 15:23:33


... from a (probably obsolete) Q-Chem user guide found on the Web:

***
" To run parallel Q-Chem
using a batch scheduler such as PBS, users may have to modify the
mpirun command in $QC/bin/parallel.csh
depending on whether or not the MPI implementation requires the
-machinefi le option to be given.
For further details users should read the $QC/README.Parallel le, and contact
Q-Chem if any problems are encountered (email: support_at_q-
chem.com).
Parallel users should also read the above section on using serial Q-Chem.
Users can also run Q-Chem's coupled-cluster calculations in parallel on multi-core architectures.
Please see section 5.12 for detail"
***

Guesses:
1) Q-Chem is launched by a bunch of scripts provided by Q-chem.com or something the like,
and the mpiexec command line is buried somewhere in those scripts,
not directly visible by the user. Right?

2) Look for the -machinefile switch in their script $QC/bin/parallel.csh
and replace it by
-hostfile $PBS_NODEFILE.

My two cents,
Gus Correa

On Aug 10, 2013, at 3:03 PM, Gustavo Correa wrote:

> Hi Lee-Ping
>
> I know nothing about Q-Chem, but I was confused by these sentences:
>
> "That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes. "
>
> "I do not observe this error when submitting single-node jobs."
>
> "Since my jobs are only parallel over the node they’re running on, I believe that a node file of any kind is unnecessary. "
>
> Are you trying to run MPI jobs across several nodes or inside a single node?
>
> ***
>
> Anyway, as far as I know,
> if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun command
> will look for the $PBS_NODEFILE to learn in which node(s) it should launch the MPI
> processes, regardless of whether you are using one node or more than one node.
>
> You didn't send your mpiexec command line (which would help), but assuming that
> Q-Chem allows some level of standard mpiexec command options, you could force
> passing the $PBS_NODEFILE to it.
>
> Something like this (for two nodes with 8 cores each):
>
> #PBS -q myqueue
> #PBS -l nodes=2:ppn=8
> #PBS -N myjob
> cd $PBS_O_WORKDIR
> ls -l $PBS_NODEFILE
> cat $PBS_NODEFILE
>
> mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable <parameters to Q-chem>
>
> I hope this helps,
> Gus Correa
>
> On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:
>
>> Hi there,
>>
>> Recently, I’ve begun some calculations on a cluster where I submit a multiple node job to the Torque batch system, and the job executes multiple single-node parallel tasks. That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes.
>>
>> Some background: The actual program being executed is Q-Chem 4.0. I use OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile and this is the last known version of OpenMPI that this version of Q-Chem is known to work with.
>>
>> My jobs are failing with the error message below; I do not observe this error when submitting single-node jobs. From reading the mailing list archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php), I believe it is looking for a PBS node file somewhere. Since my jobs are only parallel over the node they’re running on, I believe that a node file of any kind is unnecessary.
>>
>> My question is: Why is OpenMPI behaving differently when I submit a multi-node job compared to a single-node job? How does OpenMPI detect that it is running under a multi-node allocation? Is there a way I can change OpenMPI’s behavior so it always thinks it’s running on a single node, regardless of the type of job I submit to the batch system?
>>
>> Thank you,
>>
>> - Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University)
>>
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users