Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] SLURM affinity accounting in Open MPI
From: Artem Polyakov (artpol84_at_[hidden])
Date: 2014-02-12 05:42:25


Hello

I found that SLURM installations that use cgroup plugin and
have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all
processes on non-launch node are assigned on one core. This leads to quite
poor performance.
The problem can be seen only if using mpirun to start parallel application
in batch script. For example: *mpirun ./mympi*
If using srun with PMI affinity is setted properly: *srun ./mympi.*

Close look shows that the reason lies in the way Open MPI use srun to
launch ORTE daemons. Here is example of the command line:
*srun* *--nodes=1* *--ntasks=1* --kill-on-bad-exit --nodelist=node02
*orted*-mca ess slurm -mca orte_ess_jobid 3799121920 -mca
orte_ess_vpid

Saying *--nodes=1* *--ntasks=1* to SLURM means that you want to start one
task and (with TaskAffinity=yes) it will be binded to one core. Next orted
use this affinity as base for all spawned branch processes. If I understand
correctly the problem behind using srun is that if you say *srun*
*--nodes=1* *--ntasks=4* - then SLURM will spawn 4 independent orted
processes binded to different cores which is not what we really need.

I found that disabling of cpu binding as a fast hack works good for cgroup
plugin. Since job runs inside a group which has core access restrictions,
spawned branch processes are executed under nodes scheduler control on all
allocated cores. The command line looks like this:
srun *--cpu_bind=none* --nodes=1 --ntasks=1 --kill-on-bad-exit
--nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca
orte_ess_vpid

This solution will probably won't work with SLURM task/affinity plugin.
Also it may be a bad idea when strong affinity desirable.

My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I
will try to make more reliable solution but I need more time and beforehand
would like to know opinion of Open MPI developers.

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov