Another possibility to check - it is entirely possible that Moab is miscommunicating the values to Slurm. You might need to check it - I'll install a copy of 2.6.5 on my machines and see if I get similar issues when Slurm does the allocation itself.

On Feb 12, 2014, at 7:47 AM, Ralph Castain <rhc@open-mpi.org> wrote:


On Feb 12, 2014, at 7:32 AM, Adrian Reber <adrian@lisas.de> wrote:


$ msub -I -l nodes=3:ppn=8
salloc: Job is in held state, pending scheduler release
salloc: Pending job allocation 131828
salloc: job 131828 queued and waiting for resources
salloc: job 131828 has been allocated resources
salloc: Granted job allocation 131828
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1
sh-4.1$ rpm -q slurm
slurm-2.6.5-1.el6.x86_64
sh-4.1$ echo $SLURM_NNODES 
1
sh-4.1$ echo $SLURM_JOB_NODELIST 
xxxx[107-108,176]
sh-4.1$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)
sh-4.1$ echo $SLURM_NODELIST 
xxxx[107-108,176]
sh-4.1$ echo $SLURM_NPROCS  
1
sh-4.1$ echo $SLURM_NTASKS 
1
sh-4.1$ echo $SLURM_TASKS_PER_NODE 
1

The information in *_NODELIST seems to make sense, but all the other
variables (PROCS, TASKS, NODES) report '1', which seems wrong.

Indeed - and that's the problem. Slurm 2.6.5 is the most recent release, and my guess is that SchedMD once again has changed the @$!#%#@ meaning of their envars. Frankly, it is nearly impossible to track all the variants they have created over the years.

Please check to see if someone did a little customizing on your end as sometimes people do that to Slurm. Could also be they did something in the Slurm config file that is causing the changed behavior.

Meantime, I'll try to ponder a potential solution in case this really is the "latest" Slurm screwup.




On Wed, Feb 12, 2014 at 07:19:54AM -0800, Ralph Castain wrote:
...and your version of Slurm?

On Feb 12, 2014, at 7:19 AM, Ralph Castain <rhc@open-mpi.org> wrote:

What is your SLURM_TASKS_PER_NODE?

On Feb 12, 2014, at 6:58 AM, Adrian Reber <adrian@lisas.de> wrote:

No, the system has only a few MOAB_* variables and many SLURM_*
variables:

$BASH                         $IFS                          $SECONDS                      $SLURM_PTY_PORT
$BASHOPTS                     $LINENO                       $SHELL                        $SLURM_PTY_WIN_COL
$BASHPID                      $LINES                        $SHELLOPTS                    $SLURM_PTY_WIN_ROW
$BASH_ALIASES                 $MACHTYPE                     $SHLVL                        $SLURM_SRUN_COMM_HOST
$BASH_ARGC                    $MAILCHECK                    $SLURMD_NODENAME              $SLURM_SRUN_COMM_PORT
$BASH_ARGV                    $MOAB_CLASS                   $SLURM_CHECKPOINT_IMAGE_DIR   $SLURM_STEPID
$BASH_CMDS                    $MOAB_GROUP                   $SLURM_CONF                   $SLURM_STEP_ID
$BASH_COMMAND                 $MOAB_JOBID                   $SLURM_CPUS_ON_NODE           $SLURM_STEP_LAUNCHER_PORT
$BASH_LINENO                  $MOAB_NODECOUNT               $SLURM_DISTRIBUTION           $SLURM_STEP_NODELIST
$BASH_SOURCE                  $MOAB_PARTITION               $SLURM_GTIDS                  $SLURM_STEP_NUM_NODES
$BASH_SUBSHELL                $MOAB_PROCCOUNT               $SLURM_JOBID                  $SLURM_STEP_NUM_TASKS
$BASH_VERSINFO                $MOAB_SUBMITDIR               $SLURM_JOB_CPUS_PER_NODE      $SLURM_STEP_TASKS_PER_NODE
$BASH_VERSION                 $MOAB_USER                    $SLURM_JOB_ID                 $SLURM_SUBMIT_DIR
$COLUMNS                      $OPTERR                       $SLURM_JOB_NODELIST           $SLURM_SUBMIT_HOST
$COMP_WORDBREAKS              $OPTIND                       $SLURM_JOB_NUM_NODES          $SLURM_TASKS_PER_NODE
$DIRSTACK                     $OSTYPE                       $SLURM_LAUNCH_NODE_IPADDR     $SLURM_TASK_PID
$EUID                         $PATH                         $SLURM_LOCALID                $SLURM_TOPOLOGY_ADDR
$GROUPS                       $POSIXLY_CORRECT              $SLURM_NNODES                 $SLURM_TOPOLOGY_ADDR_PATTERN
$HISTCMD                      $PPID                         $SLURM_NODEID                 $SRUN_DEBUG
$HISTFILE                     $PS1                          $SLURM_NODELIST               $TERM
$HISTFILESIZE                 $PS2                          $SLURM_NPROCS                 $TMPDIR
$HISTSIZE                     $PS4                          $SLURM_NTASKS                 $UID
$HOSTNAME                     $PWD                          $SLURM_PRIO_PROCESS           $_
$HOSTTYPE                     $RANDOM                       $SLURM_PROCID                 



On Wed, Feb 12, 2014 at 06:12:45AM -0800, Ralph Castain wrote:
Seems rather odd - since this is managed by Moab, you shouldn't be seeing SLURM envars at all. What you should see are PBS_* envars, including a PBS_NODEFILE that actually contains the allocation.


On Feb 12, 2014, at 4:42 AM, Adrian Reber <adrian@lisas.de> wrote:

I tried the nightly snapshot (openmpi-1.7.5a1r30692.tar.gz) on a system
with slurm and moab. I requested an interactive session using:

msub -I -l nodes=3:ppn=8

and started a simple test case which fails:

$ mpirun -np 2 ./mpi-test 1
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2 slots 
that were requested by the application:
./mpi-test

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
srun: error: xxxx108: task 1: Exited with exit code 1
srun: Terminating job step 131823.4
srun: error: xxxx107: task 0: Exited with exit code 1
srun: Job step aborted
slurmd[xxxx108]: *** STEP 131823.4 KILLED AT 2014-02-12T13:30:32 WITH SIGNAL 9 ***


requesting only one core works:

$ mpirun  ./mpi-test 1
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 1: 0.000000


using openmpi-1.6.5 works with multiple cores:

$ mpirun -np 24 ./mpi-test 2
4.4.7 20120313 (Red Hat 4.4.7-4):Process 0 on xxxx106 out of 24: 0.000000
4.4.7 20120313 (Red Hat 4.4.7-4):Process 12 on xxxx106 out of 24: 12.000000
4.4.7 20120313 (Red Hat 4.4.7-4):Process 11 on xxxx108 out of 24: 11.000000
4.4.7 20120313 (Red Hat 4.4.7-4):Process 18 on xxxx106 out of 24: 18.000000

$ echo $SLURM_JOB_CPUS_PER_NODE 
8(x3)

I never used slurm before so this could also be a user error on my side.
But as 1.6.5 works it seems something has changed and wanted to let
you know in case it was not intentionally.

Adrian
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Adrian

-- 
Adrian Reber <adrian@lisas.de>            http://lisas.de/~adrian/
"Let us all bask in television's warm glowing warming glow." -- Homer Simpson
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Adrian

-- 
Adrian Reber <adrian@lisas.de>            http://lisas.de/~adrian/
There's got to be more to life than compile-and-go.
_______________________________________________
devel mailing list
devel@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel