Re: [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc... | linux-trace-kernel

Re: [PATCH v16 2/3] mm: Improve RSS counter approximation accuracy for proc interfaces

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: 2026-01-14 19:21:39
Also in: linux-mm, lkml

On 2026-01-14 11:48, Michal Hocko wrote:

On Wed 14-01-26 09:59:14, Mathieu Desnoyers wrote:

quoted

Use hierarchical per-cpu counters for RSS tracking to improve the
accuracy of per-mm RSS sum approximation on large many-core systems [1].
This improves the accuracy of the RSS values returned by proc
interfaces.

This is also a preparation step to introduce a 2-pass OOM killer task
selection which leverages the approximation and accuracy ranges to
quickly eliminate tasks which are outside of the range of the current
selection, and thus reduce the latency introduced by execution of the
OOM killer.

Here is a (possibly incomplete) list of the prior approaches that were
used or proposed, along with their downside:

1) Per-thread rss tracking: large error on many-thread processes.

2) Per-CPU counters: up to 12% slower for short-lived processes and 9%
    increased system time in make test workloads [1]. Moreover, the
    inaccuracy increases with O(n^2) with the number of CPUs.

3) Per-NUMA-node counters: requires atomics on fast-path (overhead),
    error is high with systems that have lots of NUMA nodes (32 times
    the number of NUMA nodes).

4) Use a percise per-cpu counter sum for each counter value query:
    Requires iteration on each possible CPUs for each sum, which
    adds overhead (and thus increases OOM killer latency) on large
    many-core systems running many processes.

The approach proposed here is to replace the per-cpu counters by the
hierarchical per-cpu counters, which bounds the inaccuracy based on the
system topology with O(N*logN).

* Testing results:

Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs total)

Methodology:

Comparing the current upstream implementation with the hierarchical
counters is done by keeping both implementations wired up in parallel,
and running a single-process, single-threaded program which hops
randomly across CPUs in the system, calling mmap(2) and munmap(2) on
random CPUs, keeping track of an array of allocated mappings, randomly
choosing entries to either map or unmap.

get_mm_counter() is instrumented to compare the upstream counter
approximation to the precise value, and print the delta when going over
a given threshold. The delta of the hierarchical counter approximation
to the precise value is also printed for comparison.

After a few minutes running this test, the upstream implementation
counter approximation reaches a 1GB delta from the
precise value, compared to 80MB delta with the hierarchical counter.
The hierarchical counter provides a guaranteed maximum approximation
inaccuracy of 192MB on that hardware topology.

* Fast path implementation comparison

The new inline percpu_counter_tree_add() uses a this_cpu_add_return()
for the fast path (under a certain allocation size threshold).  Above
that, it calls a slow path which "trickles up" the carry to upper level
counters with atomic_add_return.

In comparison, the upstream counters implementation calls
percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast
path, and does a raw_spin_lock_irqsave above a certain threshold.

The hierarchical implementation is therefore expected to have less
contention on mid-sized allocations than the upstream counters because
the atomic counters tracking those bits are only shared across nearby
CPUs. In comparison, the upstream counters immediately use a global
spinlock when reaching the threshold.

* Benchmarks

Using will-it-scale page_fault1 benchmarks to compare the upstream
counters to the hierarchical counters. This is done with hyperthreading
disabled. The speedup is within the standard deviation of the upstream
runs, so the overhead is not significant.

                                           upstream   hierarchical    speedup
page_fault1_processes -s 100 -t 1           614783         615558      +0.1%
page_fault1_threads -s 100 -t 1             612788         612447      -0.1%
page_fault1_processes -s 100 -t 96        37994977       37932035      -0.2%
page_fault1_threads -s 100 -t 96           2484130        2504860      +0.8%
page_fault1_processes -s 100 -t 192       71262917       71118830      -0.2%
page_fault1_threads -s 100 -t 192          2446437        2469296      +0.1%

This change depends on the following patch:
"mm: Fix OOM killer inaccuracy on large many-core systems" [2]

As mentioned in the previous patch, it would be great to explicitly
mention what is the memory price for the new tracking data structure.

Yes, I can add the explanation here as well.

Other than that this seems like a generally useful improvement for
larger systems and it is my understanding that it doesn't add almost any
overhead on small end systems, correct?

Indeed, the impact is mostly on large many-core systems, not so much on
smaller systems.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help