Hi all,
When slurm is configured with the following parameters
TaskPlugin=task/affinity
TaskPluginParam=Cpusets
srun binds the processes by placing them into different
cpusets, each containing a single core.
e.g. "srun -N 2 -n 4" will create 2 cpusets in each of the two allocated
nodes and place the four ranks there, each single rank with a singleton as
a cpu constraint.
The issue in that case is in the macro OPAL_PAFFINITY_PROCESS_IS_BOUND (in
opal/mca/paffinity/paffinity.h):
. opal_paffinity_base_get_processor_info() fills in num_processors with 1
(this is the size of each cpu_set)
. num_bound is set to 1 too
and this implies *bound=false
So, the binding is correctly done by slurm and not detected by MPI.
To support the cpuset binding done by slurm, I propose the following patch:
hg diff opal/mca/paffinity/paffinity.h
diff -r 4d8c8a39b06f opal/mca/paffinity/paffinity.h
--- a/opal/mca/paffinity/paffinity.h Thu Apr 21 17:38:00 2011 +0200
+++ b/opal/mca/paffinity/paffinity.h Tue Jul 12 15:44:59 2011 +0200
@@ -218,7 +218,8 @@
num_bound++; \
} \
} \
- if (0 < num_bound && num_bound < num_processors) { \
+ if (0 < num_bound && ((num_processors == 1) || \
+ (num_bound < num_processors))) { \
*(bound) = true; \
} \
} \
|