Open MPI logo

Hardware Locality Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Hardware Locality Development mailing list

Subject: Re: [hwloc-devel] Memory affinity
From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2011-02-28 16:01:39


On Feb 28, 2011, at 3:47 PM, David Singleton wrote:

> I dont think you can avoid the problem. Unless it has changed very recently, Linux swapin_readahead is the main culprit in messing with NUMA locality on that platform. Faulting a single page causes 8 or 16 or whatever contiguous pages to be read from swap. An arbitrary contiguous range of pages in swap may not even come from the same process far less the same NUMA node. My understanding is that since there is no NUMA info with the swap entry, the only policy that can be applied to is that of the faulting vma in the faulting process. The faulted page will have the desired NUMA placement but possibly not the rest. So swapping mixes different process' NUMA policies leading to a "NUMA diffusion process".

That is terrible!

Is the only way to avoid this to pin the memory so that it doesn't get swapped out? (which is evil in its own way)

> Here's a contrived example on a 2.6.27 kernel.
>
> # Grab 3 lots of 10000MB on a 24GB Nehalem node:
>
> v1100:~ > numactl --membind=0 ./memory_grabber 10000 &
> [1] 434
> v1100:~ > numactl --membind=1 ./memory_grabber 10000 &
> [2] 435
> v1100:~ > ./memory_grabber 10000 &
> [3] 436
>
> # Time sequence of NUMA page locality for the 3 processes:
>
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=2184075 dirty=2184075 active=1104219 N0=2184075
> 7ffd861da000 bind:1 anon=1709350 dirty=1709350 active=918142 N1=1709350
> 7ffd861da000 default anon=2086028 dirty=2086028 active=1194354 N0=774151 N1=1311877
>
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1777593 dirty=1678821 swapcache=98772 active=744021 N0=1777524 N1=69
> 7ffd861da000 bind:1 anon=1649256 dirty=1649256 active=797862 N1=1649256
> 7ffd861da000 default anon=2313532 dirty=2143102 swapcache=170430 active=1928372 N0=982483 N1=1331049
>
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1619803 dirty=1521031 swapcache=98772 active=652729 N0=1617878 N1=1925
> 7ffd861da000 bind:1 anon=1616983 dirty=1616983 active=771814 N1=1616983
> 7ffd861da000 default anon=2393655 dirty=2223225 swapcache=170430 active=2147908 N0=1052167 N1=1341488
>
> v1100:~ > cat /proc/43?/numa_maps | grep 7ffd861da000
> 7ffd861da000 bind:0 anon=1490293 dirty=1391521 swapcache=98772 active=679807 N0=1482914 N1=7379
> 7ffd861da000 bind:1 anon=1850875 dirty=1850873 swapcache=2 active=996836 N0=256407 N1=1594468
> 7ffd861da000 default anon=2484496 dirty=2314066 swapcache=170430 active=2396456 N0=1083215 N1=1401281

I'm sorry; I'm not too familiar with the output of /proc/*/numa_maps -- what is this showing? I see some entries switching from active=X to swapcache=X, assumedly meaning that they have been swapped out...?

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/