Open MPI logo

Open MPI User's Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Open MPI User's mailing list

Subject: Re: [OMPI users] SLURM and OpenMPI
From: Tim Prins (tprins_at_[hidden])
Date: 2008-03-20 17:48:20


Hi Werner,

Open MPI does things a little bit differently than other MPIs when it
comes to supporting SLURM. See
http://www.open-mpi.org/faq/?category=slurm
for general information about running with Open MPI on SLURM.

After trying the commands you sent, I am actually a bit surprised by the
results. I would have expected this mode of operation to work. But
looking at the environment variables that SLURM is setting for us, I can
see why it doesn't.

On a cluster with 4 cores/node, I ran:
[tprins_at_odin ~]$ cat mprun.sh
#!/bin/sh
printenv
[tprins_at_odin ~]$ srun -N 2 -n 2 -b mprun.sh
srun: jobid 55641 submitted
[tprins_at_odin ~]$ cat slurm-55641.out |grep SLURM_TASKS_PER_NODE
SLURM_TASKS_PER_NODE=4(x2)
[tprins_at_odin ~]$

Which seems to be wrong, since the srun man page says that
SLURM_TASKS_PER_NODE is the "Number of tasks to be initiated on each
node". This seems to imply that the value should be "1(x2)". So maybe
this is a SLURM problem? If this value were correctly reported, Open MPI
should work fine for what you wanted to do.

Two other things:
1. You should probably use the command line option '--npernode' for
mpirun instead of setting the rmaps_base_n_pernode directly.
2. In regards to your second example below, Open MPI by default maps 'by
slot'. That is, it will fill all available slots on the first node
before moving to the second. You can change this, see:
http://www.open-mpi.org/faq/?category=running#mpirun-scheduling

I have copied Ralph on this mail to see if he has a better response.

Tim

Werner Augustin wrote:
> Hi,
>
> At our site here at the University of Karlsruhe we are running two
> large clusters with SLURM and HP-MPI. For our new cluster we want to
> keep SLURM and switch to OpenMPI. While testing I got the following
> problem:
>
> with HP-MPI I do something like
>
> srun -N 2 -n 2 -b mpirun -srun helloworld
>
> and get
>
> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n14.
>
> when I try the same with OpenMPI (version 1.2.4)
>
> srun -N 2 -n 2 -b mpirun helloworld
>
> I get
>
> Hi, here is process 1 of 8, running MPI version 2.0 on xc3n13.
> Hi, here is process 0 of 8, running MPI version 2.0 on xc3n13.
> Hi, here is process 5 of 8, running MPI version 2.0 on xc3n14.
> Hi, here is process 2 of 8, running MPI version 2.0 on xc3n13.
> Hi, here is process 4 of 8, running MPI version 2.0 on xc3n14.
> Hi, here is process 3 of 8, running MPI version 2.0 on xc3n13.
> Hi, here is process 6 of 8, running MPI version 2.0 on xc3n14.
> Hi, here is process 7 of 8, running MPI version 2.0 on xc3n14.
>
> and with
>
> srun -N 2 -n 2 -b mpirun -np 2 helloworld
>
> Hi, here is process 0 of 2, running MPI version 2.0 on xc3n13.
> Hi, here is process 1 of 2, running MPI version 2.0 on xc3n13.
>
> which is still wrong, because it uses only one of the two allocated
> nodes.
>
> OpenMPI uses the SLURM_NODELIST and SLURM_TASKS_PER_NODE environment
> variables, starts with slurm one orted per node and tasks upto the
> maximum number of slots on every node. So basically it also does
> some 'resource management' and interferes with slurm. OK, I can fix that
> with a mpirun wrapper script which calls mpirun with the right -np and
> the right rmaps_base_n_pernode setting, but it gets worse. We want to
> allocate computing power on a per cpu base instead of per node, i.e.
> different user might share a node. In addition slurm allows to schedule
> according to memory usage. Therefore it is important that on every node
> there is exactly the number of tasks running that slurm wants. The only
> solution I came up with is to generate for every job a detailed
> hostfile and call mpirun --hostfile. Any suggestions for improvement?
>
> I've found a discussion thread "slurm and all-srun orterun" in the
> mailinglist archive concerning the same problem, where Ralph Castain
> announced that he is working on two new launch methods which would fix
> my problems. Unfortunately his email address is deleted from the
> archive, so it would be really nice if the friendly elf mentioned there
> is still around and could forward my mail to him.
>
> Thanks in advance,
> Werner Augustin
> _______________________________________________
> users mailing list
> users_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/users