Open MPI logo

Open MPI Development Mailing List Archives

  |   Home   |   Support   |   FAQ   |   all Development mailing list

Subject: [OMPI devel] Fwd: [hwloc-devel] possible membind changes coming in the Linux kernel
From: Jeffrey Squyres (jsquyres_at_[hidden])
Date: 2012-03-16 17:16:26


This isn't strictly related to Open MPI, but all of us here care about NUMA, locality, and performance, so I thought I'd pass along something that Brice forwarded to the hwloc-devel list.

See Brice's note below, and the original mail to the LKML below that.

Begin forwarded message:

> From: Brice Goglin <Brice.Goglin_at_[hidden]>
> Subject: [hwloc-devel] possible membind changes coming in the Linux kernel
> Date: March 16, 2012 11:11:35 AM EDT
> To: hwloc development <hwloc-devel_at_[hidden]>
> Reply-To: Hardware locality development list <hwloc-devel_at_[hidden]>
>
> We'll have to check the compatiblity of this thing with the hwloc membind API if/when it gets merged in the Linux kernel.
> Lee Schermerhorn's Migrate-on-Fault is supposed to be hwloc_membind_nexttouch, that would be a very good news.
>
> Brice
>
>
> -------- Message original --------
> Sujet: [RFC][PATCH 00/26] sched/numa
> Date : Fri, 16 Mar 2012 15:40:28 +0100
> De : Peter Zijlstra <a.p.zijlstra_at_[hidden]>
> Pour : Linus Torvalds <torvalds_at_[hidden]>, Andrew Morton <akpm_at_[hidden]>, Thomas Gleixner <tglx_at_[hidden]>, Ingo Molnar <mingo_at_[hidden]>, Paul Turner <pjt_at_[hidden]>, Suresh Siddha <suresh.b.siddha_at_[hidden]>, Mike Galbraith <efault_at_[hidden]>, "Paul E. McKenney" <paulmck_at_[hidden]>, Lai Jiangshan <laijs_at_[hidden]>, Dan Smith <danms_at_[hidden]>, Bharata B Rao <bharata.rao_at_[hidden]>, Lee Schermerhorn <Lee.Schermerhorn_at_[hidden]>, Andrea Arcangeli <aarcange_at_[hidden]>, Rik van Riel <riel_at_[hidden]>, Johannes Weiner <hannes_at_[hidden]>
> Copie à : linux-kernel_at_[hidden], linux-mm_at_[hidden]
>
> Hi All,
>
> While the current scheduler has knowledge of the machine topology, including
> NUMA (although there's room for improvement there as well [1]), it is
> completely insensitive to which nodes a task's memory actually is on.
>
> Current upstream task memory allocation prefers to use the node the task is
> currently running on (unless explicitly told otherwise, see
> mbind()/set_mempolicy()), and with the scheduler free to move the task about at
> will, the task's memory can end up being spread all over the machine's nodes.
>
> While the scheduler does a reasonable job of keeping short running tasks on a
> single node (by means of simply not doing the cross-node migration very often),
> it completely blows for long-running processes with a large memory footprint.
>
> This patch-set aims at improving this situation. It does so by assigning a
> preferred, or home, node to every process/thread_group. Memory allocation is
> then directed by this preference instead of the node the task might actually be
> running on momentarily. The load-balancer is also modified to prefer running
> the task on its home-node, although not at the cost of letting CPUs go idle or
> at the cost of execution fairness.
>
> On top of this a new NUMA balancer is introduced, which can change a process'
> home-node the hard way. This heavy process migration is driven by two factors:
> either tasks are running away from their home-node, or memory is being
> allocated away from the home-node. In either case, it tries to move processes
> around to make the 'problem' go away.
>
> The home-node migration handles both cpu and memory (anonymous only for now) in
> an integrated fashion. The memory migration uses migrate-on-fault to avoid
> doing a lot of work from the actual numa balancer kernl thread and only
> migrates the active memory.
>
> For processes that have more tasks than would fit on a node and which want to
> split their activity in a useful fashion, the patch-set introduces two new
> syscalls: sys_numa_tbind()/sys_numa_mbind(). These syscalls can be used to
> create {thread}x{vma} groups which are then scheduled as a unit instead of the
> entire process.
>
> That said, its still early days and there's lots of improvements to make.
>
> On to the actual patches...
>
> The first two are generic cleanups:
>
> [01/26] mm, mpol: Re-implement check_*_range() using walk_page_range()
> [02/26] mm, mpol: Remove NUMA_INTERLEAVE_HIT
>
> The second set is a rework of Lee Schermerhorn's Migrate-on-Fault patches [2]:
>
> [03/26] mm, mpol: add MPOL_MF_LAZY ...
> [04/26] mm, mpol: add MPOL_MF_NOOP
> [05/26] mm, mpol: Check for misplaced page
> [06/26] mm: Migrate misplaced page
> [07/26] mm: Handle misplaced anon pages
> [08/26] mm, mpol: Simplify do_mbind()
>
> The third set implements the basic numa balancing:
>
> [09/26] sched, mm: Introduce tsk_home_node()
> [10/26] mm, mpol: Make mempolicy home-node aware
> [11/26] mm, mpol: Lazy migrate a process/vma
> [12/26] sched, mm: sched_{fork,exec} node assignment
> [13/26] sched: Implement home-node awareness
> [14/26] sched, numa: Numa balancer
> [15/26] sched, numa: Implement hotplug hooks
> [16/26] sched, numa: Abstract the numa_entity
>
> The next three patches are a band-aid, Lai Jiangshan (and Paul McKenney) are
> doing a proper implementation.. the reverts are me being lazy about fwd porting
> my call_srcu() implementation.
>
> [17/26] srcu: revert1
> [18/26] srcu: revert2
> [19/26] srcu: Implement call_srcu()
>
> The last bits implement the new syscalls:
>
> [20/26] mm, mpol: Introduce vma_dup_policy()
> [21/26] mm, mpol: Introduce vma_put_policy()
> [22/26] mm, mpol: Split and explose some mempolicy functions
> [23/26] sched, numa: Introduce sys_numa_{t,m}bind()
> [24/26] mm, mpol: Implement numa_group RSS accounting
> [25/26] sched, numa: Only migrate long-running entities
> [26/26] sched, numa: A few debug bits
>
>
> And a few numbers...
>
> On my WSM-EP (2 nodes, 6 cores/node, 2 thread/core), running 48 stream
> benchmarks [3] (modified to use ~230MB and run long).
>
> Without these patches it degrades into 50-50 local/remote memory accesses:
>
> Performance counter stats for 'sleep 2':
>
> 259,668,750 r01b7_at_500b:u [100.00%]
> 262,170,142 r01b7_at_200b:u
>
> 2.010446121 seconds time elapsed
>
> With the patches there's a significant improvement in locality:
>
> Performance counter stats for 'sleep 2':
>
> 496,860,345 r01b7_at_500b:u [100.00%]
> 78,292,565 r01b7_at_200b:u
>
> 2.010707488 seconds time elapsed
>
> (the perf events are a bit magical and not supported in an actual perf
> release -- but the first one is L3 misses to local dram, the second is
> L3 misses to remote dram)
>
> If you look at those numbers you can also see that the sum is greater in the
> second case, this means that we can service L3 misses at a higher rate, which
> translates into a performance gain.
>
> These numbers also show that while there's a marked improvement, there's still
> some gain to be had. The current numa balancer is still somewhat fickle.
>
> ~ Peter
>
>
> [1] -
> http://marc.info/?l=linux-kernel&m=130218515520540
>
> now that we have SD_OVERLAP it should be fairly easy to do.
>
> [2] -
> http://markmail.org/message/mdwbcitql5ka4uws
>
>
> [3] -
> https://asc.llnl.gov/computing_resources/purple/archive/benchmarks/memory/stream.tar
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to
> majordomo_at_[hidden]
>
> More majordomo info at
> http://vger.kernel.org/majordomo-info.html
>
> Please read the FAQ at
> http://www.tux.org/lkml/
> _______________________________________________
> hwloc-devel mailing list
> hwloc-devel_at_[hidden]
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel

-- 
Jeff Squyres
jsquyres_at_[hidden]
For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/