I found that SLURM installations that use cgroup plugin and
have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all
processes on non-launch node are assigned on one core. This leads to quite
The problem can be seen only if using mpirun to start parallel application
in batch script. For example: *mpirun ./mympi*
If using srun with PMI affinity is setted properly: *srun ./mympi.*
Close look shows that the reason lies in the way Open MPI use srun to
launch ORTE daemons. Here is example of the command line:
*srun* *--nodes=1* *--ntasks=1* --kill-on-bad-exit --nodelist=node02
*orted*-mca ess slurm -mca orte_ess_jobid 3799121920 -mca
Saying *--nodes=1* *--ntasks=1* to SLURM means that you want to start one
task and (with TaskAffinity=yes) it will be binded to one core. Next orted
use this affinity as base for all spawned branch processes. If I understand
correctly the problem behind using srun is that if you say *srun*
*--nodes=1* *--ntasks=4* - then SLURM will spawn 4 independent orted
processes binded to different cores which is not what we really need.
I found that disabling of cpu binding as a fast hack works good for cgroup
plugin. Since job runs inside a group which has core access restrictions,
spawned branch processes are executed under nodes scheduler control on all
allocated cores. The command line looks like this:
srun *--cpu_bind=none* --nodes=1 --ntasks=1 --kill-on-bad-exit
--nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca
This solution will probably won't work with SLURM task/affinity plugin.
Also it may be a bad idea when strong affinity desirable.
My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I
will try to make more reliable solution but I need more time and beforehand
would like to know opinion of Open MPI developers.
Ð¡ Ð£Ð²Ð°Ð¶ÐµÐ½Ð¸ÐµÐ¼, ÐÐ¾Ð»ÑÐºÐ¾Ð² ÐÑÑÐµÐ¼ Ð®ÑÑÐµÐ²Ð¸Ñ
Best regards, Artem Y. Polyakov