Re: [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: 2026-01-14 19:21:39
Also in:
linux-mm, lkml
On 2026-01-14 11:48, Michal Hocko wrote:
On Wed 14-01-26 09:59:14, Mathieu Desnoyers wrote:quoted
Use hierarchical per-cpu counters for RSS tracking to improve the accuracy of per-mm RSS sum approximation on large many-core systems [1]. This improves the accuracy of the RSS values returned by proc interfaces. This is also a preparation step to introduce a 2-pass OOM killer task selection which leverages the approximation and accuracy ranges to quickly eliminate tasks which are outside of the range of the current selection, and thus reduce the latency introduced by execution of the OOM killer. Here is a (possibly incomplete) list of the prior approaches that were used or proposed, along with their downside: 1) Per-thread rss tracking: large error on many-thread processes. 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% increased system time in make test workloads [1]. Moreover, the inaccuracy increases with O(n^2) with the number of CPUs. 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), error is high with systems that have lots of NUMA nodes (32 times the number of NUMA nodes). 4) Use a percise per-cpu counter sum for each counter value query: Requires iteration on each possible CPUs for each sum, which adds overhead (and thus increases OOM killer latency) on large many-core systems running many processes. The approach proposed here is to replace the per-cpu counters by the hierarchical per-cpu counters, which bounds the inaccuracy based on the system topology with O(N*logN). * Testing results: Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs total) Methodology: Comparing the current upstream implementation with the hierarchical counters is done by keeping both implementations wired up in parallel, and running a single-process, single-threaded program which hops randomly across CPUs in the system, calling mmap(2) and munmap(2) on random CPUs, keeping track of an array of allocated mappings, randomly choosing entries to either map or unmap. get_mm_counter() is instrumented to compare the upstream counter approximation to the precise value, and print the delta when going over a given threshold. The delta of the hierarchical counter approximation to the precise value is also printed for comparison. After a few minutes running this test, the upstream implementation counter approximation reaches a 1GB delta from the precise value, compared to 80MB delta with the hierarchical counter. The hierarchical counter provides a guaranteed maximum approximation inaccuracy of 192MB on that hardware topology. * Fast path implementation comparison The new inline percpu_counter_tree_add() uses a this_cpu_add_return() for the fast path (under a certain allocation size threshold). Above that, it calls a slow path which "trickles up" the carry to upper level counters with atomic_add_return. In comparison, the upstream counters implementation calls percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast path, and does a raw_spin_lock_irqsave above a certain threshold. The hierarchical implementation is therefore expected to have less contention on mid-sized allocations than the upstream counters because the atomic counters tracking those bits are only shared across nearby CPUs. In comparison, the upstream counters immediately use a global spinlock when reaching the threshold. * Benchmarks Using will-it-scale page_fault1 benchmarks to compare the upstream counters to the hierarchical counters. This is done with hyperthreading disabled. The speedup is within the standard deviation of the upstream runs, so the overhead is not significant. upstream hierarchical speedup page_fault1_processes -s 100 -t 1 614783 615558 +0.1% page_fault1_threads -s 100 -t 1 612788 612447 -0.1% page_fault1_processes -s 100 -t 96 37994977 37932035 -0.2% page_fault1_threads -s 100 -t 96 2484130 2504860 +0.8% page_fault1_processes -s 100 -t 192 71262917 71118830 -0.2% page_fault1_threads -s 100 -t 192 2446437 2469296 +0.1% This change depends on the following patch: "mm: Fix OOM killer inaccuracy on large many-core systems" [2]As mentioned in the previous patch, it would be great to explicitly mention what is the memory price for the new tracking data structure.
Yes, I can add the explanation here as well.
Other than that this seems like a generally useful improvement for larger systems and it is my understanding that it doesn't add almost any overhead on small end systems, correct?
Indeed, the impact is mostly on large many-core systems, not so much on smaller systems. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com