Thread (138 messages) 138 messages, 17 authors, 2022-09-08

Re: [RFC PATCH 00/30] Code tagging framework and applications

From: Suren Baghdasaryan <surenb@google.com>
Date: 2022-09-05 18:07:46
Also in: io-uring, linux-arch, linux-bcache, linux-iommu, linux-mm, lkml, xen-devel

On Mon, Sep 5, 2022 at 1:58 AM Marco Elver [off-list ref] wrote:
On Mon, 5 Sept 2022 at 10:12, Michal Hocko [off-list ref] wrote:
quoted
On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
quoted
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko [off-list ref] wrote:
[...]
quoted
quoted
Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really.  You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).
As promised, I profiled a simple code that repeatedly makes 10
allocations/frees in a loop and measured overheads of code tagging,
call stack capturing and tracing+BPF for page and slab allocations.
Summary:

Page allocations (overheads are compared to get_free_pages() duration):
6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
8.8% lookup_page_ext
1237% call stack capture
139% tracepoint with attached empty BPF program
Yes, I am not surprised that the call stack capturing is really
expensive comparing to the allocator fast path (which is really highly
optimized and I suspect that with 10 allocation/free loop you mostly get
your memory from the pcp lists). Is this overhead still _that_ visible
for somehow less microoptimized workloads which have to take slow paths
as well?

Also what kind of stack unwinder is configured (I guess ORC)? This is
not my area but from what I remember the unwinder overhead varies
between ORC and FP.

And just to make it clear. I do realize that an overhead from the stack
unwinding is unavoidable. And code tagging would logically have lower
overhead as it performs much less work. But the main point is whether
our existing stack unwiding approach is really prohibitively expensive
to be used for debugging purposes on production systems. I might
misremember but I recall people having bigger concerns with page_owner
memory footprint than the actual stack unwinder overhead.
This is just to point out that we've also been looking at cheaper
collection of the stack trace (for KASAN and other sanitizers). The
cheapest way to unwind the stack would be a system with "shadow call
stack" enabled. With compiler support it's available on arm64, see
CONFIG_SHADOW_CALL_STACK. For x86 the hope is that at one point the
kernel will support CET, which newer Intel and AMD CPUs support.
Collecting the call stack would then be a simple memcpy.
Thanks for the note Marco! I'll check out the CONFIG_SHADOW_CALL_STACK
on Android.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help