Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] mpirun oddity w/ PBS on an SGI UV
From: Paul Hargrove (phhargrove_at_[hidden])
Date: 2014-01-31 18:13:39


Ralph,

As I said this is NOT a cluster - it is a 4k-core shared memory machine.
TORQUE is allocating cpus (time-shared mode, IIRC), not nodes.
So, there is always exactly one line in $PBS_NODESFILE.

The system runs as 2 partitions of 2k-cores each.
So, the contents odf$PBS_NODESFILE has exactly 2 possible values, each 1
line.

The values of PBS_PPN and PBS_NCPUS both reflect the size of the allocation.

At a minimum, shouldn't Open MPI be multiplying the lines in $PBS_NODESFILE
by the value of $PBS_PPN?

Additionally, when I try "mpirun -npernode 16 ./ring_c" I am still told
there are not enough slots.
Shouldn't that be working with 1 line is $PBS_NODESFILE?

-Paul

On Fri, Jan 31, 2014 at 2:47 PM, Ralph Castain <rhc_at_[hidden]> wrote:

> We read the nodes from the PBS_NODEFILE, Paul - can you pass that along?
>
> On Jan 31, 2014, at 2:33 PM, Paul Hargrove <phhargrove_at_[hidden]> wrote:
>
> I am trying to test the trunk on an SGI UV (to validate Nathan's port of
> btl:vader to SGI's variant of xpmem).
>
> At configure time, PBS's TM support was correctly located.
>
> My PBS batch script includes
> #PBS -l ncpus=16
> because that is what this installation requires (not nodes, mppnodes, or
> anything like that).
> One is allocating cpus on a large shared-memory machine, not a set of
> nodes in a cluster.
>
> However, this appears to be causing mpirun to think I have just 1 slot:
>
> + mpirun -np 2 ./ring_c
> --------------------------------------------------------------------------
> There are not enough slots available in the system to satisfy the 2 slots
> that were requested by the application:
> ./ring_c
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --------------------------------------------------------------------------
>
> In case they contain useful info, here are the PBS env vars in the job:
>
> PBS_HT_NCPUS=32
> PBS_VERSION=TORQUE-2.3.13
> PBS_JOBNAME=qs
> PBS_ENVIRONMENT=PBS_BATCH
> PBS_HOME=/var/spool/torque
>
> PBS_O_WORKDIR=/usr/users/6/hargrove/SCRATCH/OMPI/openmpi-trunk-linux-x86_64-uv-trunk/BLD/examples
> PBS_PPN=16
> PBS_TASKNUM=1
> PBS_O_HOME=/usr/users/6/hargrove
> PBS_MOMPORT=15003
> PBS_O_QUEUE=debug
> PBS_O_LOGNAME=hargrove
> PBS_O_LANG=en_US.UTF-8
> PBS_JOBCOOKIE=9EEF5DF75FA705A241FEF66EDFE01C5B
> PBS_NODENUM=0
> PBS_O_SHELL=/usr/psc/shells/bash
> PBS_SERVER=tg-login1.blacklight.psc.teragrid.org
> PBS_JOBID=314827.tg-login1.blacklight.psc.teragrid.org
> PBS_NCPUS=16
> PBS_O_HOST=tg-login1.blacklight.psc.teragrid.org
> PBS_VNODENUM=0
> PBS_QUEUE=debug_r1
> PBS_O_MAIL=/var/mail/hargrove
> PBS_NODEFILE=/var/spool/torque/aux//
> 314827.tg-login1.blacklight.psc.teragrid.org
> PBS_O_PATH=[...removed...]
>
> If any additional info is needed to help make mpirun "just work", please
> let me know.
>
> However, at this point I am mostly interested in any work-arounds that
> will let me run something other than a singleton on this system.
>
> -Paul
>
> --
> Paul H. Hargrove PHHargrove_at_[hidden]
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
Paul H. Hargrove                          PHHargrove_at_[hidden]
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900