I think I am ready to return to mpirun affinity handling discussion. I have
more general solution now. It is beta-tested (just one cluster with SLURM
using cgroups plugin). But it shows my main idea and if it is worth to be
included into mainstream I am ready to polish or improove it.
The code respects SLURM cpu allocation through SLURM_JOB_CPUS_PER_NODE and
handles such cases correctly:
It splits the node lists into groups having equal number of cpus. In the
example above we will have 2 groups:
1) node0, node1, node2 with 12 cpus
2) node3 with 7 cpus.
then it uses separate srun's for each group.
The weakness of this patch is that we need to deal with several srun's and
I am not sure that cleanup will perform correctly. I plan to test this case
2014-02-12 17:42 GMT+07:00 Artem Polyakov <artpol84_at_[hidden]>:
> I found that SLURM installations that use cgroup plugin and
> have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all
> processes on non-launch node are assigned on one core. This leads to quite
> poor performance.
> The problem can be seen only if using mpirun to start parallel application
> in batch script. For example: *mpirun ./mympi*
> If using srun with PMI affinity is setted properly: *srun ./mympi.*
> Close look shows that the reason lies in the way Open MPI use srun to
> launch ORTE daemons. Here is example of the command line:
> *srun* *--nodes=1* *--ntasks=1* --kill-on-bad-exit --nodelist=node02
> *orted* -mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid
> Saying *--nodes=1* *--ntasks=1* to SLURM means that you want to start one
> task and (with TaskAffinity=yes) it will be binded to one core. Next orted
> use this affinity as base for all spawned branch processes. If I understand
> correctly the problem behind using srun is that if you say *srun*
> *--nodes=1* *--ntasks=4* - then SLURM will spawn 4 independent orted
> processes binded to different cores which is not what we really need.
> I found that disabling of cpu binding as a fast hack works good for cgroup
> plugin. Since job runs inside a group which has core access restrictions,
> spawned branch processes are executed under nodes scheduler control on all
> allocated cores. The command line looks like this:
> srun *--cpu_bind=none* --nodes=1 --ntasks=1 --kill-on-bad-exit
> --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca
> This solution will probably won't work with SLURM task/affinity plugin.
> Also it may be a bad idea when strong affinity desirable.
> My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I
> will try to make more reliable solution but I need more time and beforehand
> would like to know opinion of Open MPI developers.
> Ð¡ Ð£Ð²Ð°Ð¶ÐµÐ½Ð¸ÐµÐ¼, ÐÐ¾Ð»ÑÐºÐ¾Ð² ÐÑÑÐµÐ¼ Ð®ÑÑÐµÐ²Ð¸Ñ
> Best regards, Artem Y. Polyakov
Ð¡ Ð£Ð²Ð°Ð¶ÐµÐ½Ð¸ÐµÐ¼, ÐÐ¾Ð»ÑÐºÐ¾Ð² ÐÑÑÐµÐ¼ Ð®ÑÑÐµÐ²Ð¸Ñ
Best regards, Artem Y. Polyakov