Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Jody Klymak (jklymak_at_[hidden])
Date: 2009-08-10 17:25:44


Hi Ralph,

On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:

> Umm...are you saying that your $PBS_NODEFILE contains the following:

No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local

each repeated 8 times. So that seems to be working....

>> xserve01.local np=8
>> xserve02.local np=8
>
> If so, that could be part of the problem - it isn't the standard
> notation we are expecting to see in that file. What Torque normally
> provides is one line for each slot, so we would expect to see
> "xserve01.local" repeated 8 times, followed by "xserve02.local"
> repeated 8 times. Given the different syntax, we may not be parsing
> the file correctly. How was this file created?

The file I am referring to above is the $TORQUEHOME/server_priv/nodes
file, that I created it by hand based on my understanding of the docs
at:

http://www.clusterresources.com/torquedocs/nodeconfig.shtml

> Also, could you clarify what node mpirun is executing on?

mpirun seems to only run on xserve02

The job I'm running is just creating a file:

#!/bin/bash

H=`hostname`
echo $H
sleep 10
uptime >& $H.txt

In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...

Again, if I run with "ssh" outside of pbs I get the expected response.

Thanks, Jody

> Ralph
>
> On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:
>
>>
>> Hi All,
>>
>> I've been trying to get torque pbs to work on my OS X 10.5.7
>> cluster with openMPI (after finding that Xgrid was pretty flaky
>> about connections). I *think* this is an MPI problem (perhaps via
>> operator error!)
>>
>> If I submit openMPI with:
>>
>>
>> #PBS -l nodes=2:ppn=8
>>
>> mpirun MyProg
>>
>>
>> pbs locks off two of the processors, checked via "pbsnodes -a", and
>> the job output. But mpirun runs the whole job on the second of the
>> two processors.
>>
>> If I run the same job w/o qsub (i.e. using ssh)
>> mpirun -n 16 -host xserve01,xserve02 MyProg
>> it runs fine on all the nodes....
>>
>> My /var/spool/toque/server_priv/nodes file looks like:
>>
>> xserve01.local np=8
>> xserve02.local np=8
>>
>>
>> Any idea what could be going wrong or how to debu this properly?
>> There is nothing suspicious in the server or mom logs.
>>
>> Thanks for any help,
>>
>> Jody
>>
>>
>>
>>
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users

--
Jody Klymak
http://web.uvic.ca/~jklymak/