Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Memory affinity
From: David Singleton (David.Singleton_at_[hidden])
Date: 2011-02-28 16:37:52


On 03/01/2011 08:01 AM, Jeff Squyres wrote:
> On Feb 28, 2011, at 3:47 PM, David Singleton wrote:
>
>> I dont think you can avoid the problem. Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping mixes different process' NUMA policies leading to a "NUMA diffusion process".
>
> That is terrible!
>
> Is the only way to avoid this to pin the memory so that it doesn't get swapped out? (which is evil in its own way)

AFAIK, yes. I understand various heuristics have been added to the swap code to improve the probability of good NUMA placement but still no guarantees.

>
>> Here's a contrived example on a 2.6.27 kernel.
>>
>> # Grab 3 lots of 10000MB on a 24GB Nehalem node:
>>
>> v1100:~> numactl --membind=0 ./memory_grabber 10000&
>> [1] 434
>> v1100:~> numactl --membind=1 ./memory_grabber 10000&
>> [2] 435
>> v1100:~> ./memory_grabber 10000&
>> [3] 436
>>
>> # Time sequence of NUMA page locality for the 3 processes:
>>
>> v1100:~> cat /proc/43?/numa_maps | grep 7ffd861da000
>> 7ffd861da000 bind:0 anon=2184075 dirty=2184075 active=1104219 N0=2184075
>> 7ffd861da000 bind:1 anon=1709350 dirty=1709350 active=918142 N1=1709350
>> 7ffd861da000 default anon=2086028 dirty=2086028 active=1194354 N0=774151 N1=1311877
>>
>> v1100:~> cat /proc/43?/numa_maps | grep 7ffd861da000
>> 7ffd861da000 bind:0 anon=1777593 dirty=1678821 swapcache=98772 active=744021 N0=1777524 N1=69
>> 7ffd861da000 bind:1 anon=1649256 dirty=1649256 active=797862 N1=1649256
>> 7ffd861da000 default anon=2313532 dirty=2143102 swapcache=170430 active=1928372 N0=982483 N1=1331049
>>
>> v1100:~> cat /proc/43?/numa_maps | grep 7ffd861da000
>> 7ffd861da000 bind:0 anon=1619803 dirty=1521031 swapcache=98772 active=652729 N0=1617878 N1=1925
>> 7ffd861da000 bind:1 anon=1616983 dirty=1616983 active=771814 N1=1616983
>> 7ffd861da000 default anon=2393655 dirty=2223225 swapcache=170430 active=2147908 N0=1052167 N1=1341488
>>
>> v1100:~> cat /proc/43?/numa_maps | grep 7ffd861da000
>> 7ffd861da000 bind:0 anon=1490293 dirty=1391521 swapcache=98772 active=679807 N0=1482914 N1=7379
>> 7ffd861da000 bind:1 anon=1850875 dirty=1850873 swapcache=2 active=996836 N0=256407 N1=1594468
>> 7ffd861da000 default anon=2484496 dirty=2314066 swapcache=170430 active=2396456 N0=1083215 N1=1401281
>
> I'm sorry; I'm not too familiar with the output of /proc/*/numa_maps -- what is this showing? I see some entries switching from active=X to swapcache=X, assumedly meaning that they have been swapped out...?
>

Apologies. It's the N0 and N1 that are of interest - the number of pages on the two NUMA nodes. We start with the distribution in each process aligned with
their memory policy. But as swapping occurs we see pages allocated on the "other" NUMA node in the two processes that are supposed bound to one NUMA node. (The
third process is just there to cause paging.)