Thread (138 messages) 138 messages, 17 authors, 2022-09-08

Re: [RFC PATCH 00/30] Code tagging framework and applications

From: Suren Baghdasaryan <surenb@google.com>
Date: 2022-09-06 16:11:46
Also in: io-uring, linux-arch, linux-bcache, linux-iommu, linux-mm, lkml, xen-devel

On Tue, Sep 6, 2022 at 1:01 AM Michal Hocko [off-list ref] wrote:
On Mon 05-09-22 11:03:35, Suren Baghdasaryan wrote:
quoted
On Mon, Sep 5, 2022 at 1:12 AM Michal Hocko [off-list ref] wrote:
quoted
On Sun 04-09-22 18:32:58, Suren Baghdasaryan wrote:
quoted
On Thu, Sep 1, 2022 at 12:15 PM Michal Hocko [off-list ref] wrote:
[...]
quoted
quoted
Yes, tracking back the call trace would be really needed. The question
is whether this is really prohibitively expensive. How much overhead are
we talking about? There is no free lunch here, really.  You either have
the overhead during runtime when the feature is used or on the source
code level for all the future development (with a maze of macros and
wrappers).
As promised, I profiled a simple code that repeatedly makes 10
allocations/frees in a loop and measured overheads of code tagging,
call stack capturing and tracing+BPF for page and slab allocations.
Summary:

Page allocations (overheads are compared to get_free_pages() duration):
6.8% Codetag counter manipulations (__lazy_percpu_counter_add + __alloc_tag_add)
8.8% lookup_page_ext
1237% call stack capture
139% tracepoint with attached empty BPF program
Yes, I am not surprised that the call stack capturing is really
expensive comparing to the allocator fast path (which is really highly
optimized and I suspect that with 10 allocation/free loop you mostly get
your memory from the pcp lists). Is this overhead still _that_ visible
for somehow less microoptimized workloads which have to take slow paths
as well?
Correct, it's a comparison with the allocation fast path, so in a
sense represents the worst case scenario. However at the same time the
measurements are fair because they measure the overheads against the
same meaningful baseline, therefore can be used for comparison.
Yes, I am not saying it is an unfair comparision. It is just not a
particularly practical one for real life situations. So I am not sure
you can draw many conclusions from that. Or let me put it differently.
There is not real point comparing the code tagging and stack unwiding
approaches because the later is simply more complex because it collects
more state. The main question is whether that additional state
collection is too expensive to be practically used.
You asked me to provide the numbers in one of your replies, that's what I did.
quoted
quoted
Also what kind of stack unwinder is configured (I guess ORC)? This is
not my area but from what I remember the unwinder overhead varies
between ORC and FP.
I used whatever is default and didn't try other mechanisms. Don't
think the difference would be orders of magnitude better though.
quoted
And just to make it clear. I do realize that an overhead from the stack
unwinding is unavoidable. And code tagging would logically have lower
overhead as it performs much less work. But the main point is whether
our existing stack unwiding approach is really prohibitively expensive
to be used for debugging purposes on production systems. I might
misremember but I recall people having bigger concerns with page_owner
memory footprint than the actual stack unwinder overhead.
That's one of those questions which are very difficult to answer (if
even possible) because that would depend on the use scenario. If the
workload allocates frequently then adding the overhead will likely
affect it, otherwise might not be even noticeable. In general, in
pre-production testing we try to minimize the difference in
performance and memory profiles between the software we are testing
and the production one. From that point of view, the smaller the
overhead, the better. I know it's kinda obvious but unfortunately I
have no better answer to that question.
This is clear but it doesn't really tell whether the existing tooling is
unusable for _your_ or any specific scenarios. Because when we are
talking about adding quite a lot of code and make our allocators APIs
more complicated to track the state then we should carefully weigh the
benefit and the cost. As replied to other email I am really skeptical
this patchset is at the final stage and the more allocators get covered
the more code we have to maintain. So there must be a very strong reason
to add it.
The patchset is quite complete at this point. Instrumenting new
allocators takes 3 lines of code, see how kmalloc_hooks macro is used
in https://lore.kernel.org/all/20220830214919.53220-17-surenb@google.com/ (local)
quoted
For the memory overhead, in my early internal proposal with assumption
of 10000 instrumented allocation call sites, I've made some
calculations for an 8GB 8-core system (quite typical for Android) and
ended up with the following:

                                    per-cpu counters      atomic counters
page_ext references     16MB                      16MB
slab object references   10.5MB                   10.5MB
alloc_tags                      900KB                    312KB
Total memory overhead 27.4MB                  26.8MB
I do not really think this is all that interesting because the major
memory overhead contributors (page_ext and objcg are going to be there
with other approaches that want to match alloc and free as that clearly
requires to store the allocator objects somewhere).
You mentioned that memory consumption in the page_owner approach was
more important overhead, so I provided the numbers for that part of
the discussion.
quoted
so, about 0.34% of the total memory. Our implementation has changed
since then and the number might not be completely correct but it
should be in the ballpark.
I just checked the number of instrumented calls that we currently have
in the 6.0-rc3 built with defconfig and it's 165 page allocation and
2684 slab allocation sites. I readily accept that we are probably
missing some allocations and additional modules can also contribute to
these numbers but my guess it's still less than 10000 that I used in
my calculations.
yes, in the current implementation you are missing most indirect users
of the page allocator as stated elsewhere so the usefulness can be
really limited. A better coverege will not increase the memory
consumption much but it will add an additional maintenance burden that
will scale with different usecases.
Your comments in the last two letters about needing the stack tracing
and covering indirect users of the allocators makes me think that you
missed my reply here:
https://lore.kernel.org/all/CAJuCfpGZ==v0HGWBzZzHTgbo4B_ZBe6V6U4T_788LVWj8HhCRQ@mail.gmail.com/ (local).
I messed up with formatting but hopefully it's still readable. The
idea of having two stage tracking - first one very cheap and the
second one more in-depth I think should address your concerns about
indirect users.
Thanks,
Suren.
--
Michal Hocko
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help