Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-10 19:28:21


No problem - actually, that default works with any environment, not
just Torque

On Aug 10, 2009, at 4:37 PM, Gus Correa wrote:

> Thank you for the correction, Ralph.
> I didn't know there was a (wise) default for the
> number of processes when using Torque-enabled OpenMPI.
>
> Gus Correa
>
> Ralph Castain wrote:
>> Just to correct something said here.
>>> You need to tell mpirun how many processes to launch,
>>> regardless of whether you are using Torque or not.
>> This is not correct. If you don't tell mpirun how many processes to
>> launch, we will automatically launch one process for every slot in
>> your allocation. In the case described here, there were 16 slots
>> allocated, so we would automatically launch 16 processes.
>> Ralph
>> On Aug 10, 2009, at 3:47 PM, Gus Correa wrote:
>>> Hi Jody, list
>>>
>>> See comments inline.
>>>
>>> Jody Klymak wrote:
>>>> On Aug 10, 2009, at 13:01 PM, Gus Correa wrote:
>>>>> Hi Jody
>>>>>
>>>>> We don't have Mac OS-X, but Linux, not sure if this applies to
>>>>> you.
>>>>>
>>>>> Did you configure your OpenMPI with Torque support,
>>>>> and pointed to the same library that provides the
>>>>> Torque you are using (--with-tm=/path/to/torque-library-
>>>>> directory)?
>>>> Not explicitly. I'll check into that....
>>>
>>>
>>> 1) If you don't do it explicitly, configure will use the first
>>> libtorque
>>> it finds (and that works I presume),
>>> which may/may not be the one you want, if you have more than one.
>>> If you only have one version of Torque installed,
>>> this shouldn't be the problem.
>>>
>>> 2) Have you tried something very simple, like the examples/hello_c.c
>>> program, to test the Torque-OpenMPI integration?
>>>
>>> 3) Also, just in case, put a "cat $PBS_NODEFILE" inside your script,
>>> before mpirun, to see what it reports.
>>> For "#PBS -l nodes=2:ppn=8"
>>> it should show 16 lines, 8 with the name of each node.
>>>
>>> 4) Finally, just to make sure the syntax is right.
>>> On your message you wrote:
>>>
>>> >>> If I submit openMPI with:
>>> >>> #PBS -l nodes=2:ppn=8
>>> >>> mpirun MyProg
>>>
>>> Is this the real syntax you used?
>>>
>>> Or was it perhaps:
>>>
>>> #PBS -l nodes=2:ppn=8
>>> mpirun -n 16 MyProg
>>>
>>> You need to tell mpirun how many processes to launch,
>>> regardless of whether you are using Torque or not.
>>>
>>> My $0.02
>>>
>>> Gus Correa
>>> ---------------------------------------------------------------------
>>> Gustavo Correa
>>> Lamont-Doherty Earth Observatory - Columbia University
>>> Palisades, NY, 10964-8000 - USA
>>> ---------------------------------------------------------------------
>>>
>>>
>>>>> Are you using the right mpirun? (There are so many out there.)
>>>> yeah - I use the explicit path and moved the OS X one.
>>>> Thanks! Jody
>>>>> Gus Correa
>>>>> ---------------------------------------------------------------------
>>>>> Gustavo Correa
>>>>> Lamont-Doherty Earth Observatory - Columbia University
>>>>> Palisades, NY, 10964-8000 - USA
>>>>> ---------------------------------------------------------------------
>>>>>
>>>>> Jody Klymak wrote:
>>>>>> Hi All,
>>>>>> I've been trying to get torque pbs to work on my OS X 10.5.7
>>>>>> cluster with openMPI (after finding that Xgrid was pretty flaky
>>>>>> about connections). I *think* this is an MPI problem (perhaps
>>>>>> via operator error!)
>>>>>> If I submit openMPI with:
>>>>>> #PBS -l nodes=2:ppn=8
>>>>>> mpirun MyProg
>>>>>> pbs locks off two of the processors, checked via "pbsnodes -a",
>>>>>> and the job output. But mpirun runs the whole job on the
>>>>>> second of the two processors.
>>>>>> If I run the same job w/o qsub (i.e. using ssh)
>>>>>> mpirun -n 16 -host xserve01,xserve02 MyProg
>>>>>> it runs fine on all the nodes....
>>>>>> My /var/spool/toque/server_priv/nodes file looks like:
>>>>>> xserve01.local np=8
>>>>>> xserve02.local np=8
>>>>>> Any idea what could be going wrong or how to debu this
>>>>>> properly? There is nothing suspicious in the server or mom logs.
>>>>>> Thanks for any help,
>>>>>> Jody
>>>>>> --
>>>>>> Jody Klymak
>>>>>> http://web.uvic.ca/~jklymak/
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> users_at_[hidden]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> users_at_[hidden]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> --
>>>> Jody Klymak
>>>> http://web.uvic.ca/~jklymak/
>>>> _______________________________________________
>>>> users mailing list
>>>> users_at_[hidden]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users