Thread (160 messages) 160 messages, 20 authors, 2023-05-08

Re: [PATCH 00/40] Memory allocation profiling

From: Suren Baghdasaryan <surenb@google.com>
Date: 2023-05-03 15:09:46
Also in: cgroups, linux-arch, linux-doc, linux-fsdevel, linux-iommu, linux-mm, lkml

On Wed, May 3, 2023 at 12:25 AM Michal Hocko [off-list ref] wrote:
On Mon 01-05-23 09:54:10, Suren Baghdasaryan wrote:
quoted
Memory allocation profiling infrastructure provides a low overhead
mechanism to make all kernel allocations in the system visible. It can be
used to monitor memory usage, track memory hotspots, detect memory leaks,
identify memory regressions.

To keep the overhead to the minimum, we record only allocation sizes for
every allocation in the codebase. With that information, if users are
interested in more detailed context for a specific allocation, they can
enable in-depth context tracking, which includes capturing the pid, tgid,
task name, allocation size, timestamp and call stack for every allocation
at the specified code location.
[...]
quoted
Implementation utilizes a more generic concept of code tagging, introduced
as part of this patchset. Code tag is a structure identifying a specific
location in the source code which is generated at compile time and can be
embedded in an application-specific structure. A number of applications
for code tagging have been presented in the original RFC [1].
Code tagging uses the old trick of "define a special elf section for
objects of a given type so that we can iterate over them at runtime" and
creates a proper library for it.

To profile memory allocations, we instrument page, slab and percpu
allocators to record total memory allocated in the associated code tag at
every allocation in the codebase. Every time an allocation is performed by
an instrumented allocator, the code tag at that location increments its
counter by allocation size. Every time the memory is freed the counter is
decremented. To decrement the counter upon freeing, allocated object needs
a reference to its code tag. Page allocators use page_ext to record this
reference while slab allocators use memcg_data (renamed into more generic
slabobj_ext) of the slab page.
[...]
quoted
[1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ (local)
[...]
quoted
 70 files changed, 2765 insertions(+), 554 deletions(-)
Sorry for cutting the cover considerably but I believe I have quoted the
most important/interesting parts here. The approach is not fundamentally
different from the previous version [1] and there was a significant
discussion around this approach. The cover letter doesn't summarize nor
deal with concerns expressed previous AFAICS. So let me bring those up
back.
Thanks for summarizing!
At least those I find the most important:
- This is a big change and it adds a significant maintenance burden
  because each allocation entry point needs to be handled specifically.
  The cost will grow with the intended coverage especially there when
  allocation is hidden in a library code.
Do you mean with more allocations in the codebase more codetags will
be generated? Is that the concern? Or maybe as you commented in
another patch that context capturing feature does not limit how many
stacks will be captured?
- It has been brought up that this is duplicating functionality already
  available via existing tracing infrastructure. You should make it very
  clear why that is not suitable for the job
I experimented with using tracing with _RET_IP_ to implement this
accounting. The major issue is the _RET_IP_ to codetag lookup runtime
overhead which is orders of magnitude higher than proposed code
tagging approach. With code tagging proposal, that link is resolved at
compile time. Since we want this mechanism deployed in production, we
want to keep the overhead to the absolute minimum.
You asked me before how much overhead would be tolerable and the
answer will always be "as small as possible". This is especially true
for slab allocators which are ridiculously fast and regressing them
would be very noticable (due to the frequent use).

There is another issue, which I think can be solved in a smart way but
will either affect performance or would require more memory. With the
tracing approach we don't know beforehand how many individual
allocation sites exist, so we have to allocate code tags (or similar
structures for counting) at runtime vs compile time. We can be smart
about it and allocate in batches or even preallocate more than we need
beforehand but, as I said, it will require some kind of compromise.

I understand that code tagging creates additional maintenance burdens
but I hope it also produces enough benefits that people will want
this. The cost is also hopefully amortized when additional
applications like the ones we presented in RFC [1] are built using the
same framework.
- We already have page_owner infrastructure that provides allocation
  tracking data. Why it cannot be used/extended?
1. The overhead.
2. Covers only page allocators.

I didn't think about extending the page_owner approach to slab
allocators but I suspect it would not be trivial. I don't see
attaching an owner to every slab object to be a scalable solution. The
overhead would again be of concern here.

I should point out that there was one important technical concern
about lack of a kill switch for this feature, which was an issue for
distributions that can't disable the CONFIG flag. In this series we
addressed that concern.

[1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.com/ (local)

Thanks,
Suren.
Thanks!
--
Michal Hocko
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help