Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |  

This web mail archive is frozen.

This page is part of a frozen web archive of this mailing list.

You can still navigate around this archive, but know that no new mails have been added to it since July of 2016.

Click here to be taken to the new web archives of this list; it includes all the mails that are in this frozen archive plus all new mails that have been sent to the list since it was migrated to the new archives.

Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job.
From: Gustavo Correa (gus_at_[hidden])
Date: 2013-08-10 15:23:33


... from a (probably obsolete) Q-Chem user guide found on the Web:

***
" To run parallel Q-Chem
using a batch scheduler such as PBS, users may have to modify the
mpirun command in $QC/bin/parallel.csh
depending on whether or not the MPI implementation requires the
-machinefi le option to be given.
For further details users should read the $QC/README.Parallel le, and contact
Q-Chem if any problems are encountered (email: support_at_q-
chem.com).
Parallel users should also read the above section on using serial Q-Chem.
Users can also run Q-Chem's coupled-cluster calculations in parallel on multi-core architectures.
Please see section 5.12 for detail"
***

Guesses:
1) Q-Chem is launched by a bunch of scripts provided by Q-chem.com or something the like,
and the mpiexec command line is buried somewhere in those scripts,
not directly visible by the user. Right?

2) Look for the -machinefile switch in their script $QC/bin/parallel.csh
and replace it by
-hostfile $PBS_NODEFILE.

My two cents,
Gus Correa

On Aug 10, 2013, at 3:03 PM, Gustavo Correa wrote:

> Hi Lee-Ping
>
> I know nothing about Q-Chem, but I was confused by these sentences:
>
> "That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes. "
>
> "I do not observe this error when submitting single-node jobs."
>
> "Since my jobs are only parallel over the node they’re running on, I believe that a node file of any kind is unnecessary. "
>
> Are you trying to run MPI jobs across several nodes or inside a single node?
>
> ***
>
> Anyway, as far as I know,
> if your OpenMPI was compiled with Torque/PBS support, the mpiexec/mpirun command
> will look for the $PBS_NODEFILE to learn in which node(s) it should launch the MPI
> processes, regardless of whether you are using one node or more than one node.
>
> You didn't send your mpiexec command line (which would help), but assuming that
> Q-Chem allows some level of standard mpiexec command options, you could force
> passing the $PBS_NODEFILE to it.
>
> Something like this (for two nodes with 8 cores each):
>
> #PBS -q myqueue
> #PBS -l nodes=2:ppn=8
> #PBS -N myjob
> cd $PBS_O_WORKDIR
> ls -l $PBS_NODEFILE
> cat $PBS_NODEFILE
>
> mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable <parameters to Q-chem>
>
> I hope this helps,
> Gus Correa
>
> On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:
>
>> Hi there,
>>
>> Recently, I’ve begun some calculations on a cluster where I submit a multiple node job to the Torque batch system, and the job executes multiple single-node parallel tasks. That is to say, these tasks are intended to use OpenMPI parallelism on each node, but no parallelism across nodes.
>>
>> Some background: The actual program being executed is Q-Chem 4.0. I use OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to compile and this is the last known version of OpenMPI that this version of Q-Chem is known to work with.
>>
>> My jobs are failing with the error message below; I do not observe this error when submitting single-node jobs. From reading the mailing list archives (http://www.open-mpi.org/community/lists/users/2010/03/12348.php), I believe it is looking for a PBS node file somewhere. Since my jobs are only parallel over the node they’re running on, I believe that a node file of any kind is unnecessary.
>>
>> My question is: Why is OpenMPI behaving differently when I submit a multi-node job compared to a single-node job? How does OpenMPI detect that it is running under a multi-node allocation? Is there a way I can change OpenMPI’s behavior so it always thinks it’s running on a single node, regardless of the type of job I submit to the batch system?
>>
>> Thank you,
>>
>> - Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford University)
>>
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 153
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/ras_base_allocate.c at line 133
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file base/plm_base_launch_support.c at line 72
>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line 167
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users