Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] torque pbs behaviour...
From: Ralph Castain (rhc_at_[hidden])
Date: 2009-08-10 16:04:58


Umm...are you saying that your $PBS_NODEFILE contains the following:

> xserve01.local np=8
> xserve02.local np=8

If so, that could be part of the problem - it isn't the standard
notation we are expecting to see in that file. What Torque normally
provides is one line for each slot, so we would expect to see
"xserve01.local" repeated 8 times, followed by "xserve02.local"
repeated 8 times. Given the different syntax, we may not be parsing
the file correctly. How was this file created?

Also, could you clarify what node mpirun is executing on?

Ralph

On Aug 10, 2009, at 1:43 PM, Jody Klymak wrote:

>
> Hi All,
>
> I've been trying to get torque pbs to work on my OS X 10.5.7 cluster
> with openMPI (after finding that Xgrid was pretty flaky about
> connections). I *think* this is an MPI problem (perhaps via
> operator error!)
>
> If I submit openMPI with:
>
>
> #PBS -l nodes=2:ppn=8
>
> mpirun MyProg
>
>
> pbs locks off two of the processors, checked via "pbsnodes -a", and
> the job output. But mpirun runs the whole job on the second of the
> two processors.
>
> If I run the same job w/o qsub (i.e. using ssh)
> mpirun -n 16 -host xserve01,xserve02 MyProg
> it runs fine on all the nodes....
>
> My /var/spool/toque/server_priv/nodes file looks like:
>
> xserve01.local np=8
> xserve02.local np=8
>
>
> Any idea what could be going wrong or how to debu this properly?
> There is nothing suspicious in the server or mom logs.
>
> Thanks for any help,
>
> Jody
>
>
>
>
>
> --
> Jody Klymak
> http://web.uvic.ca/~jklymak/
>
>
>
>
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users