Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM and OpenMPI
From: Werner Augustin (Werner.Augustin_at_[hidden])
Date: 2008-03-27 08:42:02

On Thu, 20 Mar 2008 16:40:41 -0600
Ralph Castain <rhc_at_[hidden]> wrote:

> I am no slurm expert. However, it is our understanding that
> SLURM_TASKS_PER_NODE means the number of slots allocated to the job,
> not the number of tasks to be executed on each node. So the 4(x2)
> tells us that we have 4 slots on each of two nodes to work with. You
> got 4 slots on each node because you used the -N option, which told
> slurm to assign all slots on that node to this job - I assume you
> have 4 processors on your nodes. OpenMPI parses that string to get
> the allocation, then maps the number of specified processes against
> it.

That was also my interpretation and I was absolutely sure to have read
it a couple of days ago in the srun man-page. In the meantime I
changed my opinion because now it says "Number of tasks to be initiated
on each node" as Tim has quoted. I've no idea, how Tim managed to change
the man-page on my computer ;-)

and there is another variable documented:

              Count of processors available to the job on this node.
Note the select/linear plugin allocates entire nodes to jobs, so
the value indicates the total count of CPUs on the node. The
              select/cons_res plugin allocates individual processors
to jobs, so this number indicates the number of processors on this
node allocated to the job.

Anyway, back to reality: I made some further tests, and the only way to
change the values of SLURM_TASKS_PER_NODE was to tell slurm that node x
has only y cpus in slurm.conf. The variable documented as
SLURM_CPUS_ON_NODE (in 1.0.15 and 1.2.22) doesn't seem to exist in
either version. In 1.2.22 there seems to be SLURM_JOB_CPUS_PER_NODE
which has the same value as SLURM_TASKS_PER_NODE. In a couple of days
I'll try the other allocator plugin which allocates on a cpu base
instead of a node base. And after that it probably would be a good
idea, that somebody (me?) sums up our thread and asks the slurm guys
for their opinion.

> It is possible that the interpretation of SLURM_TASKS_PER_NODE is
> different when used to allocate as opposed to directly launch
> processes. Our typical usage is for someone to do:
> srun -N 2 -A
> mpirun -np 2 helloworld
> In other words, we use srun to create an allocation, and then run
> mpirun separately within it.
> I am therefore unsure what the "-n 2" will do here. If I believe the
> documentation, it would seem to imply that srun will attempt to
> launch two copies of "mpirun -np 2 helloworld", yet your output
> doesn't seem to support that interpretation. It would appear that the
> "-n 2" is being ignored and only one copy of mpirun is being
> launched. I'm no slurm expert, so perhaps that interpretation is
> incorrect.

That indeed happens when you call "srun -N 2 mpirun -np 2 helloworld",
but "srun -N 2 -b mpirun -np 2 helloworld" submits it as a batch-job,
i.e. "mpirun -np 2 helloworld" is executed only once on one of the
allocated nodes and environment variables are set appropriately -- or
at least should be set appropriately -- that a consecutive srun or
an mpirun inside the command does the right thing.