Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Memory affinity
From: David Singleton (David.Singleton_at_[hidden])
Date: 2011-02-28 15:47:13

On 03/01/2011 05:51 AM, Jeff Squyres wrote:
> On Feb 28, 2011, at 12:24 PM, Bernd Kallies wrote:
>>> 1. I have no reason to doubt this person, but was wondering if someone could confirm this (for Linux).
>> set_mempolicy(2) of recent 2.6 kernels says:
>> Process policy is not remembered if the page is swapped out. When such a
>> page is paged back in, it will use the policy of the process or memory
>> range that is in effect at the time the page is allocated.
> Ah, interesting. That implies two different scenarios:
> 1. I set a policy, malloc some memory, that memory gets swapped out, I change the policy, then the memory gets swapped back in. And it now obeys the new policy.
> 2. I malloc some memory and set an explicit policy with hwloc_set_area_membind*(). That memory then gets swapped out, and later gets swapped back in. Since the memory will be the same memory range, it'll keep the same policy as I set with hwloc_set_area_membind(), right?
> That would seem to imply that I should always hwloc_set_area_membind() if I want it to persist beyond any potential future swapping.
> Does that sound right?

I dont think you can avoid the problem. Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that
platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even
come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be
applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping
mixes different process' NUMA policies leading to a "NUMA diffusion process". Here's a contrived example on a 2.6.27 kernel.

# Grab 3 lots of 10000MB on a 24GB Nehalem node:

v1100:~ > numactl --membind=0 ./memory_grabber 10000 &
[1] 434
v1100:~ > numactl --membind=1 ./memory_grabber 10000 &
[2] 435
v1100:~ > ./memory_grabber 10000 &
[3] 436

# Time sequence of NUMA page locality for the 3 processes:

v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
7ffd861da000 bind:0 anon=2184075 dirty=2184075 active=1104219 N0=2184075
7ffd861da000 bind:1 anon=1709350 dirty=1709350 active=918142 N1=1709350
7ffd861da000 default anon=2086028 dirty=2086028 active=1194354 N0=774151 N1=1311877

v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
7ffd861da000 bind:0 anon=1777593 dirty=1678821 swapcache=98772 active=744021 N0=1777524 N1=69
7ffd861da000 bind:1 anon=1649256 dirty=1649256 active=797862 N1=1649256
7ffd861da000 default anon=2313532 dirty=2143102 swapcache=170430 active=1928372 N0=982483 N1=1331049

v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
7ffd861da000 bind:0 anon=1619803 dirty=1521031 swapcache=98772 active=652729 N0=1617878 N1=1925
7ffd861da000 bind:1 anon=1616983 dirty=1616983 active=771814 N1=1616983
7ffd861da000 default anon=2393655 dirty=2223225 swapcache=170430 active=2147908 N0=1052167 N1=1341488

v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
7ffd861da000 bind:0 anon=1490293 dirty=1391521 swapcache=98772 active=679807 N0=1482914 N1=7379
7ffd861da000 bind:1 anon=1850875 dirty=1850873 swapcache=2 active=996836 N0=256407 N1=1594468
7ffd861da000 default anon=2484496 dirty=2314066 swapcache=170430 active=2396456 N0=1083215 N1=1401281

I suspect hwloc_set_area_membind() will do no more than set MPOL_BIND policy for the vma as has happened here.

One way around this problem is to switch off swapin_readahead but this has a large impact on swap performance and, AFAIK, there's not even a kernel tunable to
do so. As an alternative, we have toyed with running an "anti-entropy" daemon that occasionally runs numa_migrate_pages() on jobs to scoop pages back to where
they belong - not pretty.