Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-10 17:39:40


On Aug 10, 2009, at 3:25 PM, Jody Klymak wrote:

> Hi Ralph,
>
> On Aug 10, 2009, at 13:04 PM, Ralph Castain wrote:
>
>> Umm...are you saying that your $PBS_NODEFILE contains the following:
>
> No, if I put cat $PBS_NODEFILE in the pbs script I get
> xserve02.local
> ...
> xserve02.local
> xserve01.local
> ...
> xserve01.local
>
> each repeated 8 times. So that seems to be working....

Good!

>
>
>>> xserve01.local np=8
>>> xserve02.local np=8
>>
>> If so, that could be part of the problem - it isn't the standard
>> notation we are expecting to see in that file. What Torque normally
>> provides is one line for each slot, so we would expect to see
>> "xserve01.local" repeated 8 times, followed by "xserve02.local"
>> repeated 8 times. Given the different syntax, we may not be parsing
>> the file correctly. How was this file created?
>
> The file I am referring to above is the $TORQUEHOME/server_priv/
> nodes file, that I created it by hand based on my understanding of
> the docs at:
>
> http://www.clusterresources.com/torquedocs/nodeconfig.shtml

OMPI doesn't care about that file - only Torque looks at it.

>
>
>> Also, could you clarify what node mpirun is executing on?
>
> mpirun seems to only run on xserve02
>
> The job I'm running is just creating a file:
>
> #!/bin/bash
>
> H=`hostname`
> echo $H
> sleep 10
> uptime >& $H.txt
>
> In the stdout, the echo $H returns
> "xserve02.local" 16 times and only xsever02.local.txt gets created...
>
> Again, if I run with "ssh" outside of pbs I get the expected response.

Try running:

mpirun --display-allocation -pernode --display-map hostname

This will tell us what OMPI is seeing in terms of the nodes available
to it. Based on what you show above, it should see both of your nodes.
By forcing OMPI to put one proc/node, you'll be directing it to use
both nodes for the job. You should see this in the reported map.

If we then see both procs run on the same node, I would suggest
reconfiguring OMPI with --enable-debug, and then rerunning the above
command with an additional flag:

-mca plm_base_verbose 5

which will show us all the ugly details of what OMPI is telling Torque
to do. Since OMPI works fine with Torque on Linux, my guess is that
there is something about the Torque for Mac that is a little different
and thus causing problems.

Ralph

>
>
> Thanks, Jody
>
>
>
>
>> Ralph
>>
>> On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:
>>
>>>
>>> Hi All,
>>>
>>> I've been trying to get torque pbs to work on my OS X 10.5.7
>>> cluster with openMPI (after finding that Xgrid was pretty flaky
>>> about connections). I *think* this is an MPI problem (perhaps via
>>> operator error!)
>>>
>>> If I submit openMPI with:
>>>
>>>
>>> #PBS -l nodes=2:ppn=8
>>>
>>> mpirun MyProg
>>>
>>>
>>> pbs locks off two of the processors, checked via "pbsnodes -a",
>>> and the job output. But mpirun runs the whole job on the second
>>> of the two processors.
>>>
>>> If I run the same job w/o qsub (i.e. using ssh)
>>> mpirun -n 16 -host xserve01,xserve02 MyProg
>>> it runs fine on all the nodes....
>>>
>>> My /var/spool/toque/server_priv/nodes file looks like:
>>>
>>> xserve01.local np=8
>>> xserve02.local np=8
>>>
>>>
>>> Any idea what could be going wrong or how to debu this properly?
>>> There is nothing suspicious in the server or mom logs.
>>>
>>> Thanks for any help,
>>>
>>> Jody
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Jody Klymak
>>> http://web.uvic.ca/~jklymak/
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> users mailing list
>>> users_at_[hidden]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> _______________________________________________
>> users mailing list
>> users_at_[hidden]
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users