Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: Re: [OMPI devel] SLURM affinity accounting in Open MPI
From: Ralph Castain (rhc_at_[hidden])
Date: 2014-02-12 09:16:27


I'm not entirely comfortable with the solution, as the problem truly is that we are doing what you asked - i.e., if you tell Slurm to bind tasks to a single core, then we live within it. The problem with your proposed fix is that we override whatever the user may have actually wanted - e.g., if the user told Slurm to bind us to 4 cores, then we override that constraint.

If you can come up with a way that we can launch the orteds in a manner that respects whatever directive was given, while still providing added flexibility, then great. Otherwise, I would say the right solution is for users not to set TaskAffinity when using mpirun.

On Feb 12, 2014, at 2:42 AM, Artem Polyakov <artpol84_at_[hidden]> wrote:

> Hello
>
> I found that SLURM installations that use cgroup plugin and have TaskAffinity=yes in cgroup.conf have problems with OpenMPI: all processes on non-launch node are assigned on one core. This leads to quite poor performance.
> The problem can be seen only if using mpirun to start parallel application in batch script. For example: mpirun ./mympi
> If using srun with PMI affinity is setted properly: srun ./mympi.
>
> Close look shows that the reason lies in the way Open MPI use srun to launch ORTE daemons. Here is example of the command line:
> srun --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid
>
> Saying --nodes=1 --ntasks=1 to SLURM means that you want to start one task and (with TaskAffinity=yes) it will be binded to one core. Next orted use this affinity as base for all spawned branch processes. If I understand correctly the problem behind using srun is that if you say srun --nodes=1 --ntasks=4 - then SLURM will spawn 4 independent orted processes binded to different cores which is not what we really need.
>
> I found that disabling of cpu binding as a fast hack works good for cgroup plugin. Since job runs inside a group which has core access restrictions, spawned branch processes are executed under nodes scheduler control on all allocated cores. The command line looks like this:
> srun --cpu_bind=none --nodes=1 --ntasks=1 --kill-on-bad-exit --nodelist=node02 orted -mca ess slurm -mca orte_ess_jobid 3799121920 -mca orte_ess_vpid
>
> This solution will probably won't work with SLURM task/affinity plugin. Also it may be a bad idea when strong affinity desirable.
>
> My patch to stable Open MPI version (1.6.5) is attached to this e-mail. I will try to make more reliable solution but I need more time and beforehand would like to know opinion of Open MPI developers.
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> <affinity.patch>_______________________________________________
> devel mailing list
> devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel