Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: [OMPI users] SLURM and OpenMPI
From: Werner Augustin (Werner.Augustin_at_[hidden])
Date: 2008-03-20 10:24:12


Hi,

At our site here at the University of Karlsruhe we are running two
large clusters with SLURM and HP-MPI. For our new cluster we want to
keep SLURM and switch to OpenMPI. While testing I got the following
problem:

with HP-MPI I do something like

srun -N 2 -n 2 -b mpirun -srun helloworld

and get

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.

when I try the same with OpenMPI (version 1.2.4)

srun -N 2 -n 2 -b mpirun helloworld

I get

Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.

and with

srun -N 2 -n 2 -b mpirun -np 2 helloworld

Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.

which is still wrong, because it uses only one of the two allocated
nodes.

OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
variables, starts with slurm one orted per node and tasks upto the
maximum number of slots on every node. So basically it also does
some 'resource management' and interferes with slurm. OK, I can fix that
with a mpirun wrapper script which calls mpirun with the right -np and
the right rmaps_base_n_pernode setting, but it gets worse. We want to
allocate computing power on a per cpu base instead of per node, i.e.
different user might share a node. In addition slurm allows to schedule
according to memory usage. Therefore it is important that on every node
there is exactly the number of tasks running that slurm wants. The only
solution I came up with is to generate for every job a detailed
hostfile and call mpirun --hostfile. Any suggestions for improvement?

I've found a discussion thread "slurm and all-srun orterun" in the
mailinglist archive concerning the same problem, where Ralph Castain
announced that he is working on two new launch methods which would fix
my problems. Unfortunately his email address is deleted from the
archive, so it would be really nice if the friendly elf mentioned there
is still around and could forward my mail to him.

Thanks in advance,
Werner Augustin