Hi Ralph,
On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:
> Umm...are you saying that your $PBS_NODEFILE contains the following:
No, if I put cat $PBS_NODEFILE in the pbs script I get
xserve02.local
...
xserve02.local
xserve01.local
...
xserve01.local
each repeated 8 times. So that seems to be working....
>> xserve01.local np=8
>> xserve02.local np=8
>
> If so, that could be part of the problem - it isn't the standard
> notation we are expecting to see in that file. What Torque normally
> provides is one line for each slot, so we would expect to see
> "xserve01.local" repeated 8 times, followed by "xserve02.local"
> repeated 8 times. Given the different syntax, we may not be parsing
> the file correctly. How was this file created?
The file I am referring to above is the $TORQUEHOME/server_priv/nodes
file, that I created it by hand based on my understanding of the docs
at:
http://www.clusterresources.com/torquedocs/nodeconfig.shtml
> Also, could you clarify what node mpirun is executing on?
mpirun seems to only run on xserve02
The job I'm running is just creating a file:
#!/bin/bash
H=`hostname`
echo $H
sleep 10
uptime >& $H.txt
In the stdout, the echo $H returns
"xserve02.local" 16 times and only xsever02.local.txt gets created...
Again, if I run with "ssh" outside of pbs I get the expected response.
Thanks, Jody
> Ralph
>
> On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:
>
>>
>> Hi All,
>>
>> I've been trying to get torque pbs to work on my OS X 10.5.7
>> cluster with openMPI (after finding that Xgrid was pretty flaky
>> about connections). I *think* this is an MPI problem (perhaps via
>> operator error!)
>>
>> If I submit openMPI with:
>>
>>
>> #PBS -l nodes=2:ppn=8
>>
>> mpirun MyProg
>>
>>
>> pbs locks off two of the processors, checked via "pbsnodes -a", and
>> the job output. But mpirun runs the whole job on the second of the
>> two processors.
>>
>> If I run the same job w/o qsub (i.e. using ssh)
>> mpirun -n 16 -host xserve01,xserve02 MyProg
>> it runs fine on all the nodes....
>>
>> My /var/spool/toque/server_priv/nodes file looks like:
>>
>> xserve01.local np=8
>> xserve02.local np=8
>>
>>
>> Any idea what could be going wrong or how to debu this properly?
>> There is nothing suspicious in the server or mom logs.
>>
>> Thanks for any help,
>>
>> Jody
>>
>>
>>
>>
>>
>> --
>> Jody Klymak
>> http://web.uvic.ca/~jklymak/
>>
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users
--
Jody Klymak
http://web.uvic.ca/~jklymak/
|