Is there any plan to support NUMA memory binding for tasks?
Even with bind-to-core and memory affinity in 1.4.3 we were seeing 15-20%
variation in run times on a Nehalem cluster. This turned out to be mostly due
to bad page placement. Residual pagecache pages from the last job on a node (or
the memory of a suspended job in the case of preemption) could occasionally cause
a lot of non-local page placement. We hacked the libnuma module to MPOL_BIND
tasks to their local memory and eliminated the majority of this variability.
We are currently running with this as default behaviour since its "the right
thing" for 99% of jobs (we have an environment variable to back off to affinity
for the rest).
I'm guessing/hoping doing the above based on hwloc will be easier/more
maintainable. As a first pass, when is that likely to be an option?