Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] Error launching single-node tasks from multiple-node job.
From: Gustavo Correa (gus_at_[hidden])
Date: 2013-08-10 20:36:07


Hi Lee-Ping

Yes, configuring --without-tm, as Ralph told you to do,
will make your OpenMPI independent from Torque, although as Ralph said,
even with an Open MPI configured with Torque support you can override it at runtime.

I don't know what Open MPI uses the PBS_JOBID for, maybe some internal check,
but I would guess it will eventually use the PBS_NODEFILE as the list of nodes
that is passed to mpiexec under the hood.

I would just do your steps 3 and 4 below slightly different.
I don't think you should change the PBS_NODEFILE environment variable,
as Torque may use it for other purposes (say, keep track of the nodes in use, etc).
[You may not be able to change it, but I haven't tried to.]

My suggestion is:

3&4) In the Q-Chem wrapper script, make sure mpirun is called with the comman
line argument: -machinefile /scratch/leeping/pbs_nodefile.$HOSTNAME

This will leave the PBS_NODEFILE variable intact, and have the same net effect as
your workflow.

Anyway, congratulations for sorting things out and making it work!

Gus Correa

On Aug 10, 2013, at 7:40 PM, Lee-Ping Wang wrote:

> Hi Ralph,
>
> Thank you. I didn't know that "--without-tm" was the correct configure
> option. I built and reinstalled OpenMPI 1.4.2, and now I no longer need to
> set PBS_JOBID for it to recognize the correct machine file. My current
> workflow is:
>
> 1) Submit a multiple-node batch job.
> 2) Launch a separate process on each node with "pbsdsh".
> 2) On each node, create a file called
> /scratch/leeping/pbs_nodefile.$HOSTNAME which contains 24 instances of the
> hostname (since there are 24 cores).
> 3) Set $PBS_NODEFILE=/scratch/leeping/pbs_nodefile.$HOSTNAME.
> 4) In the Q-Chem wrapper script, make sure mpirun is called with the command
> line argument: -machinefile $PBS_NODEFILE
>
> Everything seems to work, thanks to your help and Gus. I might report back
> if the jobs fail halfway through or if there is no speedup, but for now
> everything seems to be in place.
>
> - Lee-Ping
>
> -----Original Message-----
> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Ralph Castain
> Sent: Saturday, August 10, 2013 4:28 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Error launching single-node tasks from
> multiple-node job.
>
> It helps if you use the correct configure option: --without-tm
>
> Regardless, you can always deselect Torque support at runtime. Just put the
> following in your environment:
>
> OMPI_MCA_ras=^tm
>
> That will tell ORTE to ignore the Torque allocation module and it should
> then look at the machinefile.
>
>
> On Aug 10, 2013, at 4:18 PM, "Lee-Ping Wang" <leeping_at_[hidden]> wrote:
>
>> Hi Gus,
>>
>> I agree that $PBS_JOBID should not point to a file in normal
>> situations, because it is the job identifier given by the scheduler.
>> However, ras_tm_module.c actually does search for a file named
>> $PBS_JOBID, and that seems to be why it was failing. You can see this
>> in the source code as well (look at ras_tm_module.c, I uploaded it to
>> https://dl.dropboxusercontent.com/u/5381783/ras_tm_module.c ). Once I
>> changed the $PBS_JOBID environment variable to the name of the node
>> file, things seemed to work - though I agree, it's not very logical.
>>
>> I doubt Q-Chem is causing the issue, because I was able to "fix"
>> things by changing $PBS_JOBID before Q-Chem is called. Also, I
>> provided the command line to mpirun in a previous email, where the
>> -machinefile argument correctly points to the custom machine file that
>> I created. The missing environment variables should not matter.
>>
>> The PBS_NODEFILE created by Torque is
>> /opt/torque/aux//272139.certainty.stanford.edu and it never gets
>> touched. I followed the advice in your earlier email and I created my
>> own node file on each node called
>> /scratch/leeping/pbs_nodefile.$HOSTNAME, and I set PBS_NODEFILE to
>> point to this file. However, this file does not get used either, even
>> if I include it on the mpirun command line, unless I set PBS_JOBID to the
> file name.
>>
>> Finally, I was not able to build OpenMPI 1.4.2 without pbs support. I
>> used the configure flag --without-rte-support, but the build failed
>> halfway through.
>>
>> Thanks,
>>
>> - Lee-Ping
>>
>> leeping_at_certainty-a:~/temp$ qsub -I -q debug -l walltime=1:00:00 -l
>> nodes=1:ppn=12
>> qsub: waiting for job 272139.certainty.stanford.edu to start
>> qsub: job 272139.certainty.stanford.edu ready
>>
>> leeping_at_compute-140-4:~$ echo $PBS_NODEFILE
>> /opt/torque/aux//272139.certainty.stanford.edu
>>
>> leeping_at_compute-140-4:~$ cat $PBS_NODEFILE
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>> compute-140-4
>>
>> leeping_at_compute-140-4:~$ echo $PBS_JOBID 272139.certainty.stanford.edu
>>
>> leeping_at_compute-140-4:~$ cat $PBS_JOBID
>> cat: 272139.certainty.stanford.edu: No such file or directory
>>
>> leeping_at_compute-140-4:~$ env | grep PBS
>> PBS_VERSION=TORQUE-2.5.3
>> PBS_JOBNAME=STDIN
>> PBS_ENVIRONMENT=PBS_INTERACTIVE
>> PBS_O_WORKDIR=/home/leeping/temp
>> PBS_TASKNUM=1
>> PBS_O_HOME=/home/leeping
>> PBS_MOMPORT=15003
>> PBS_O_QUEUE=debug
>> PBS_O_LOGNAME=leeping
>> PBS_O_LANG=en_US.iso885915
>> PBS_JOBCOOKIE=A27B00DAF72024CBEBB7CD3752BDBADC
>> PBS_NODENUM=0
>> PBS_NUM_NODES=1
>> PBS_O_SHELL=/bin/bash
>> PBS_SERVER=certainty.stanford.edu
>> PBS_JOBID=272139.certainty.stanford.edu
>> PBS_O_HOST=certainty-a.local
>> PBS_VNODENUM=0
>> PBS_QUEUE=debug
>> PBS_O_MAIL=/var/spool/mail/leeping
>> PBS_NUM_PPN=12
>> PBS_NODEFILE=/opt/torque/aux//272139.certainty.stanford.edu
>> PBS_O_PATH=/opt/intel/Compiler/11.1/064/bin/intel64:/opt/intel/Compile
>> r/11.1
>> /064/bin/intel64:/usr/local/cuda/bin:/home/leeping/opt/psi-4.0b5/bin:/
>> home/l
>> eeping/opt/tinker/bin:/home/leeping/opt/cctools/bin:/home/leeping/bin:
>> /home/
>> leeping/local/bin:/home/leeping/opt/bin:/usr/kerberos/bin:/usr/java/la
>> test/b
>> in:/usr/local/bin:/bin:/usr/bin:/opt/ganglia/bin:/opt/ganglia/sbin:/op
>> t/open
>> mpi/bin/:/opt/maui/bin:/opt/torque/bin:/opt/torque/sbin:/opt/rocks/bin
>> :/opt/ rocks/sbin:/opt/sun-ct/bin:/home/leeping/bin
>>
>> -----Original Message-----
>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gustavo
>> Correa
>> Sent: Saturday, August 10, 2013 3:58 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Error launching single-node tasks from
>> multiple-node job.
>>
>> Lee-Ping
>>
>> Something looks amiss.
>> PBS_JOBID contains the job name.
>> PBS_NODEFILE contains a list (with repetitions up to the number of
>> cores) of the nodes that torque assigned to the job.
>>
>> Why things get twisted it is hard to tell, it may be something in the
>> Q-Chem scripts (could it be mixing up PBS_JOBID and PBS_NODEFILE?), it
>> may be something else.
>> A more remote possibility is if the cluster has a Torque qsub wrapper
>> that may perhaps produce the aforementioned confusion. Unlikely, but
> possible.
>>
>> To sort out, run any simple job (mpiexec -np 32 hostname), or even
>> your very Q-Chem job, but precede it with a bunch of printouts of the
>> PBS environment
>> variables:
>> echo $PBS_JOBID
>> echo $PBS_NODEFILE
>> ls -l $PBS_NODEFILE
>> cat $PBS_NODEFILE
>> cat $PBS_JOBID [this one should fail, because that is not a file, but
>> may work the PBS variables were messed up along the way]
>>
>> I hope this helps,
>> Gus Correa
>>
>> On Aug 10, 2013, at 6:39 PM, Lee-Ping Wang wrote:
>>
>>> Hi Gus,
>>>
>>> It seems the calculation is now working, or at least it didn't crash.
>>> I set the PBS_JOBID environment variable to the name of my custom
>>> node file. That is to say, I set
> PBS_JOBID=pbs_nodefile.compute-3-3.local.
>>> It appears that ras_tm_module.c is trying to open the file located at
>>> /scratch/leeping/$PBS_JOBID for some reason, and it is disregarding
>>> the machinefile argument on the command line.
>>>
>>> It'll be a few hours before I know for sure whether the job actually
>> worked.
>>> I still don't know why things are structured this way, however.
>>>
>>> Thanks,
>>>
>>> - Lee-Ping
>>>
>>> -----Original Message-----
>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Lee-Ping
>>> Wang
>>> Sent: Saturday, August 10, 2013 3:07 PM
>>> To: 'Open MPI Users'
>>> Subject: Re: [OMPI users] Error launching single-node tasks from
>>> multiple-node job.
>>>
>>> Hi Gus,
>>>
>>> I tried your suggestions. Here is the command line which executes
> mpirun.
>>> I was puzzled because it still reported a file open failure, so I
>>> inserted a print statement into ras_tm_module.c and recompiled. The
>> results are below.
>>> As you can see, it tries to open a different file
>>> (/scratch/leeping/272055.certainty.stanford.edu) than the one I
>>> specified (/scratch/leeping/pbs_nodefile.compute-3-3.local).
>>>
>>> - Lee-Ping
>>>
>>> === mpirun command line ===
>>> /home/leeping/opt/openmpi-1.4.2-intel11-dbg/bin/mpirun -machinefile
>>> /scratch/leeping/pbs_nodefile.compute-3-3.local -x HOME -x PWD -x QC
>>> -x QCAUX -x QCCLEAN -x QCFILEPREF -x QCLOCALSCR -x QCPLATFORM -x
>>> QCREF -x QCRSH -x QCRUNNAME -x QCSCRATCH
>>> -np 24 /home/leeping/opt/qchem40/exe/qcprog.exe
>>> .B.in.28642.qcin.1 ./qchem28642/ >>B.out
>>>
>>> === Error message from compute node === [compute-3-3.local:28666]
>>> Warning: could not find environment variable "QCLOCALSCR"
>>> [compute-3-3.local:28666] Warning: could not find environment
>>> variable "QCREF"
>>> [compute-3-3.local:28666] Warning: could not find environment
>>> variable "QCRUNNAME"
>>> Attempting to open /scratch/leeping/272055.certainty.stanford.edu
>>> [compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open
>>> failure in file ras_tm_module.c at line 155 [compute-3-3.local:28666]
>>> [[56726,0],0]
>>> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 87
>>> [compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open
>>> failure in file base/ras_base_allocate.c at line 133
>>> [compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open
>>> failure in file base/plm_base_launch_support.c at line 72
>>> [compute-3-3.local:28666] [[56726,0],0] ORTE_ERROR_LOG: File open
>>> failure in file plm_tm_module.c at line 167
>>>
>>> -----Original Message-----
>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Lee-Ping
>>> Wang
>>> Sent: Saturday, August 10, 2013 12:51 PM
>>> To: 'Open MPI Users'
>>> Subject: Re: [OMPI users] Error launching single-node tasks from
>>> multiple-node job.
>>>
>>> Hi Gus,
>>>
>>> Thank you. You gave me many helpful suggestions, which I will try
>>> out and get back to you. I will provide more specifics (e.g. how my
>>> jobs were
>>> submitted) in a future email.
>>>
>>> As for the queue policy, that is a highly political issue because the
>>> cluster is a shared resource. My usual recourse is to use the batch
>>> system as effectively as possible within the confines of their
>>> policies. This is why it makes sense to submit a single
>>> multiple-node batch job, which then executes several independent
> single-node tasks.
>>>
>>> - Lee-Ping
>>>
>>> -----Original Message-----
>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gustavo
>>> Correa
>>> Sent: Saturday, August 10, 2013 12:39 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Error launching single-node tasks from
>>> multiple-node job.
>>>
>>> Hi Lee-Ping
>>> On Aug 10, 2013, at 3:15 PM, Lee-Ping Wang wrote:
>>>
>>>> Hi Gus,
>>>>
>>>> Thank you for your reply. I want to run MPI jobs inside a single
>>>> node, but due to the resource allocation policies on the clusters, I
>>>> could get many more resources if I submit multiple-node "batch jobs".
>>>> Once I have a multiple-node batch job, then I can use a command like
>>>> "pbsdsh" to run single node MPI jobs on each node that is allocated
>>>> to me. Thus, the MPI jobs on each node are running independently of
>>>> each other and unaware of one another.
>>>
>>> Even if you use pbdsh to launch separate MPI jobs on individual
>>> nodes, you probably (not 100% sure about that), probably need to
>>> specify he -hostfile naming the specific node that each job will run on.
>>>
>>> Still quite confused because you didn't tell how your "qsub" command
>>> looks like, what Torque script (if any) it is launching, etc.
>>>
>>>>
>>>> The actual call to mpirun is nontrivial to get, because Q-Chem has a
>>>> complicated series of wrapper scripts which ultimately calls mpirun.
>>>
>>> Yes, I just found this out on the Web. See my previous email.
>>>
>>>> If the
>>>> jobs are failing immediately, then I only have a small window to
>>>> view the actual command through "ps" or something.
>>>>
>>>
>>> Are you launching the jobs interactively?
>>> I.e., with the -I switch to qsub?
>>>
>>>
>>>> Another option is for me to compile OpenMPI without Torque / PBS
> support.
>>>> If I do that, then it won't look for the node file anymore. Is this
>>>> correct?
>>>
>>> You will need to tell mpiexec where to launch the jobs.
>>> If I understand what you are trying to achieve (and I am not sure I
>>> do), one way to do it would be to programatically split the
>>> $PBS_NODEFILE into several hostfiles, one per MPI job (so to speak)
>>> that
>> you want to launch.
>>> Then use each of these nodefiles for each of the MPI jobs.
>>> Note that the PBS_NODEFILE has one line per-node-per-core, *not* one
>>> line per node.
>>> I have no idea how the trick above could be reconciled with the
>>> Q-Chem scripts, though.
>>>
>>> Overall, I don't understand why you would benefit from such a
>>> complicated scheme, rather than lauching either a big MPI job across
>>> all nodes that you requested (if the problem is large enough to
>>> benefit from this many cores), or launch several small single-node
>>> jobs (if the problem is small enough to fit well a single node).
>>>
>>> You may want to talk to the cluster managers, because there must be a
>>> way to reconcile their queue policies with your needs (if this not
>>> already in place).
>>> We run tons of parallel single-node jobs here, for problems that fit
>>> well a single node.
>>>
>>>
>>> My two cents
>>> Gus Correa
>>>
>>>>
>>>> I will try your suggestions and get back to you. Thanks!
>>>>
>>>> - Lee-Ping
>>>>
>>>> -----Original Message-----
>>>> From: users [mailto:users-bounces_at_[hidden]] On Behalf Of Gustavo
>>> Correa
>>>> Sent: Saturday, August 10, 2013 12:04 PM
>>>> To: Open MPI Users
>>>> Subject: Re: [OMPI users] Error launching single-node tasks from
>>>> multiple-node job.
>>>>
>>>> Hi Lee-Ping
>>>>
>>>> I know nothing about Q-Chem, but I was confused by these sentences:
>>>>
>>>> "That is to say, these tasks are intended to use OpenMPI parallelism
>>>> on
>>> each
>>>> node, but no parallelism across nodes. "
>>>>
>>>> "I do not observe this error when submitting single-node jobs."
>>>>
>>>> "Since my jobs are only parallel over the node they're running on, I
>>> believe
>>>> that a node file of any kind is unnecessary. "
>>>>
>>>> Are you trying to run MPI jobs across several nodes or inside a
>>>> single
>>> node?
>>>>
>>>> ***
>>>>
>>>> Anyway, as far as I know,
>>>> if your OpenMPI was compiled with Torque/PBS support, the
>>>> mpiexec/mpirun command will look for the $PBS_NODEFILE to learn in
>>>> which node(s) it
>>> should
>>>> launch the MPI processes, regardless of whether you are using one
>>>> node or more than one node.
>>>>
>>>> You didn't send your mpiexec command line (which would help), but
>>>> assuming that Q-Chem allows some level of standard mpiexec command
>>>> options, you
>>> could
>>>> force passing the $PBS_NODEFILE to it.
>>>>
>>>> Something like this (for two nodes with 8 cores each):
>>>>
>>>> #PBS -q myqueue
>>>> #PBS -l nodes=2:ppn=8
>>>> #PBS -N myjob
>>>> cd $PBS_O_WORKDIR
>>>> ls -l $PBS_NODEFILE
>>>> cat $PBS_NODEFILE
>>>>
>>>> mpiexec -hostfile $PBS_NODEFILE -np 16 ./my-Q-chem-executable
>>>> <parameters
>>> to
>>>> Q-chem>
>>>>
>>>> I hope this helps,
>>>> Gus Correa
>>>>
>>>> On Aug 10, 2013, at 1:51 PM, Lee-Ping Wang wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> Recently, I've begun some calculations on a cluster where I submit
>>>>> a
>>>> multiple node job to the Torque batch system, and the job executes
>>> multiple
>>>> single-node parallel tasks. That is to say, these tasks are
>>>> intended to
>>> use
>>>> OpenMPI parallelism on each node, but no parallelism across nodes.
>>>>>
>>>>> Some background: The actual program being executed is Q-Chem 4.0.
>>>>> I use
>>>> OpenMPI 1.4.2 for this, because Q-Chem is notoriously difficult to
>>>> compile and this is the last known version of OpenMPI that this
>>>> version of Q-Chem
>>> is
>>>> known to work with.
>>>>>
>>>>> My jobs are failing with the error message below; I do not observe
>>>>> this
>>>> error when submitting single-node jobs. From reading the mailing
>>>> list archives
>>> (http://www.open-mpi.org/community/lists/users/2010/03/12348.php),
>>>> I believe it is looking for a PBS node file somewhere. Since my
>>>> jobs are only parallel over the node they're running on, I believe
>>>> that a node file of any kind is unnecessary.
>>>>>
>>>>> My question is: Why is OpenMPI behaving differently when I submit a
>>>> multi-node job compared to a single-node job? How does OpenMPI
>>>> detect
>>> that
>>>> it is running under a multi-node allocation? Is there a way I can
>>>> change OpenMPI's behavior so it always thinks it's running on a
>>>> single node, regardless of the type of job I submit to the batch system?
>>>>>
>>>>> Thank you,
>>>>>
>>>>> - Lee-Ping Wang (Postdoc in Dept. of Chemistry, Stanford
>>>> University)
>>>>>
>>>>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file ras_tm_module.c at line 153
>>>>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file ras_tm_module.c at line 153
>>>>> [compute-1-1.local:10911] [[42011,0],0]
>>>>> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line
>>>>> 153 [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File
>>>>> open failure in file ras_tm_module.c at line 87
>>>>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file ras_tm_module.c at line 87
>>>>> [compute-1-1.local:10911] [[42011,0],0]
>>>>> ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line
>>>>> 87 [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File
>>>>> open failure in file base/ras_base_allocate.c at line 133
>>>>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file base/ras_base_allocate.c at line 133
>>>>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file base/ras_base_allocate.c at line 133
>>>>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file base/plm_base_launch_support.c at line 72
>>>>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file base/plm_base_launch_support.c at line 72
>>>>> [compute-1-1.local:10911] [[42011,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file base/plm_base_launch_support.c at line 72
>>>>> [compute-1-1.local:10910] [[42010,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file plm_tm_module.c at line 167
>>>>> [compute-1-1.local:10909] [[42009,0],0] ORTE_ERROR_LOG: File open
>>>>> failure in file plm_tm_module.c at line 167
>>>>> [compute-1-1.local:10911] [[42011,0],0]
>>>>> ORTE_ERROR_LOG: File open failure in file plm_tm_module.c at line
>>>>> 167 _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users