Thread (160 messages) 160 messages, 20 authors, 2023-05-08

Re: [PATCH 00/40] Memory allocation profiling

From: Tejun Heo <tj@kernel.org>
Date: 2023-05-03 18:19:34
Also in: cgroups, linux-arch, linux-doc, linux-fsdevel, linux-iommu, linux-mm, lkml

Hello,

On Wed, May 03, 2023 at 02:07:26PM -0400, Johannes Weiner wrote:
...
quoted
* Because tracking starts when the script starts running, it doesn't know
  anything which has happened upto that point, so you gotta pay attention to
  handling e.g. handling frees which don't match allocs. It's kinda annoying
  but not a huge problem usually. There are ways to build in BPF progs into
  the kernel and load it early but I haven't experiemnted with it yet
  personally.
Yeah, early loading is definitely important, especially before module
loading etc.

One common usecase is that we see a machine in the wild with a high
amount of kernel memory disappearing somewhere that isn't voluntarily
reported in vmstat/meminfo. Reproducing it isn't always
practical. Something that records early and always (with acceptable
runtime overhead) would be the holy grail.

Matching allocs to frees is doable using the pfn as the key for pages,
and virtual addresses for slab objects.

The biggest issue I had when I tried with bpf was losing updates to
the map. IIRC there is some trylocking going on to avoid deadlocks
from nested contexts (alloc interrupted, interrupt frees). It doesn't
sound like an unsolvable problem, though.
(cc'ing Alexei and Andrii)

This is the same thing that I hit with sched_ext. BPF plugged it for
struct_ops but I wonder whether it can be done for specific maps / progs -
ie. just declare that a given map or prog is not to be accessed from NMI and
bypass the trylock deadlock avoidance mechanism. But, yeah, this should be
addressed from BPF side.
Another minor thing was the stack trace map exploding on a basically
infinite number of unique interrupt stacks. This could probably also
be solved by extending the trace extraction API to cut the frames off
at the context switch boundary.

Taking a step back though, given the multitude of allocation sites in
the kernel, it's a bit odd that the only accounting we do is the tiny
fraction of voluntary vmstat/meminfo reporting. We try to cover the
biggest consumers with this of course, but it's always going to be
incomplete and is maintenance overhead too. There are on average
several gigabytes in unknown memory (total - known vmstats) on our
machines. It's difficult to detect regressions easily. And it's per
definition the unexpected cornercases that are the trickiest to track
down. So it might be doable with BPF, but it does feel like the kernel
should do a better job of tracking out of the box and without
requiring too much plumbing and somewhat fragile kernel allocation API
tracking and probing from userspace.
Yeah, easy / default visibility argument does make sense to me.

Thanks.

-- 
tejun
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help