Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

From: Liao, Chang <hidden>
Date: 2024-08-12 12:05:43
Also in: bpf, linux-perf-users, lkml


在 2024/8/10 2:40, Andrii Nakryiko 写道:

On Fri, Aug 9, 2024 at 11:34 AM Andrii Nakryiko
[off-list ref] wrote:

quoted

On Fri, Aug 9, 2024 at 12:16 AM Liao, Chang [off-list ref] wrote:

quoted



在 2024/8/9 2:26, Andrii Nakryiko 写道:

quoted

On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang [off-list ref] wrote:

quoted

Hi Andrii and Oleg.

This patch sent by me two weeks ago also aim to optimize the performance of uprobe
on arm64. I notice recent discussions on the performance and scalability of uprobes
within the mailing list. Considering this interest, I've added you and other relevant
maintainers to the CC list for broader visibility and potential collaboration.

Hi Liao,

As you can see there is an active work to improve uprobes, that
changes lifetime management of uprobes, removes a bunch of locks taken
in the uprobe/uretprobe hot path, etc. It would be nice if you can
hold off a bit with your changes until all that lands. And then
re-benchmark, as costs might shift.

Andrii, I'm trying to integrate your lockless changes into the upstream
next-20240806 kernel tree. And I ran into some conflicts. please let me
know which kernel you're currently working on.

My patches are  based on tip/perf/core. But I also just pushed all the
changes I have accumulated (including patches I haven't sent for
review just yet), plus your patches for sighand lock removed applied
on top into [0]. So you can take a look and use that as a base for
now. Keep in mind, a bunch of those patches might still change, but
this should give you the best currently achievable performance with
uprobes/uretprobes. E.g., I'm getting something like below on x86-64
(note somewhat linear scalability with number of CPU cores, with
per-CPU performance *slowly* declining):

uprobe-nop            ( 1 cpus):    3.565 ± 0.004M/s  (  3.565M/s/cpu)
uprobe-nop            ( 2 cpus):    6.742 ± 0.009M/s  (  3.371M/s/cpu)
uprobe-nop            ( 3 cpus):   10.029 ± 0.056M/s  (  3.343M/s/cpu)
uprobe-nop            ( 4 cpus):   13.118 ± 0.014M/s  (  3.279M/s/cpu)
uprobe-nop            ( 5 cpus):   16.360 ± 0.011M/s  (  3.272M/s/cpu)
uprobe-nop            ( 6 cpus):   19.650 ± 0.045M/s  (  3.275M/s/cpu)
uprobe-nop            ( 7 cpus):   22.926 ± 0.010M/s  (  3.275M/s/cpu)
uprobe-nop            ( 8 cpus):   24.707 ± 0.025M/s  (  3.088M/s/cpu)
uprobe-nop            (10 cpus):   30.842 ± 0.018M/s  (  3.084M/s/cpu)
uprobe-nop            (12 cpus):   33.623 ± 0.037M/s  (  2.802M/s/cpu)
uprobe-nop            (14 cpus):   39.199 ± 0.009M/s  (  2.800M/s/cpu)
uprobe-nop            (16 cpus):   41.698 ± 0.018M/s  (  2.606M/s/cpu)
uprobe-nop            (24 cpus):   65.078 ± 0.018M/s  (  2.712M/s/cpu)
uprobe-nop            (32 cpus):   84.580 ± 0.017M/s  (  2.643M/s/cpu)
uprobe-nop            (40 cpus):  101.992 ± 0.268M/s  (  2.550M/s/cpu)
uprobe-nop            (48 cpus):  101.032 ± 1.428M/s  (  2.105M/s/cpu)
uprobe-nop            (56 cpus):  110.986 ± 0.736M/s  (  1.982M/s/cpu)
uprobe-nop            (64 cpus):  124.145 ± 0.110M/s  (  1.940M/s/cpu)
uprobe-nop            (72 cpus):  134.940 ± 0.200M/s  (  1.874M/s/cpu)
uprobe-nop            (80 cpus):  143.918 ± 0.235M/s  (  1.799M/s/cpu)

uretprobe-nop         ( 1 cpus):    1.987 ± 0.001M/s  (  1.987M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.766 ± 0.003M/s  (  1.883M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.638 ± 0.002M/s  (  1.879M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.275 ± 0.003M/s  (  1.819M/s/cpu)
uretprobe-nop         ( 5 cpus):    9.124 ± 0.004M/s  (  1.825M/s/cpu)
uretprobe-nop         ( 6 cpus):   10.818 ± 0.007M/s  (  1.803M/s/cpu)
uretprobe-nop         ( 7 cpus):   12.721 ± 0.014M/s  (  1.817M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.639 ± 0.007M/s  (  1.705M/s/cpu)
uretprobe-nop         (10 cpus):   17.023 ± 0.009M/s  (  1.702M/s/cpu)
uretprobe-nop         (12 cpus):   18.576 ± 0.014M/s  (  1.548M/s/cpu)
uretprobe-nop         (14 cpus):   21.660 ± 0.004M/s  (  1.547M/s/cpu)
uretprobe-nop         (16 cpus):   22.922 ± 0.013M/s  (  1.433M/s/cpu)
uretprobe-nop         (24 cpus):   34.756 ± 0.069M/s  (  1.448M/s/cpu)
uretprobe-nop         (32 cpus):   44.869 ± 0.153M/s  (  1.402M/s/cpu)
uretprobe-nop         (40 cpus):   53.397 ± 0.220M/s  (  1.335M/s/cpu)
uretprobe-nop         (48 cpus):   48.903 ± 2.277M/s  (  1.019M/s/cpu)
uretprobe-nop         (56 cpus):   42.144 ± 1.206M/s  (  0.753M/s/cpu)
uretprobe-nop         (64 cpus):   42.656 ± 1.104M/s  (  0.666M/s/cpu)
uretprobe-nop         (72 cpus):   46.299 ± 1.443M/s  (  0.643M/s/cpu)
uretprobe-nop         (80 cpus):   46.469 ± 0.808M/s  (  0.581M/s/cpu)

uprobe-ret            ( 1 cpus):    1.219 ± 0.008M/s  (  1.219M/s/cpu)
uprobe-ret            ( 2 cpus):    1.862 ± 0.008M/s  (  0.931M/s/cpu)
uprobe-ret            ( 3 cpus):    2.874 ± 0.005M/s  (  0.958M/s/cpu)
uprobe-ret            ( 4 cpus):    3.512 ± 0.002M/s  (  0.878M/s/cpu)
uprobe-ret            ( 5 cpus):    3.549 ± 0.001M/s  (  0.710M/s/cpu)
uprobe-ret            ( 6 cpus):    3.425 ± 0.003M/s  (  0.571M/s/cpu)
uprobe-ret            ( 7 cpus):    3.551 ± 0.009M/s  (  0.507M/s/cpu)
uprobe-ret            ( 8 cpus):    3.050 ± 0.002M/s  (  0.381M/s/cpu)
uprobe-ret            (10 cpus):    2.706 ± 0.002M/s  (  0.271M/s/cpu)
uprobe-ret            (12 cpus):    2.588 ± 0.003M/s  (  0.216M/s/cpu)
uprobe-ret            (14 cpus):    2.589 ± 0.003M/s  (  0.185M/s/cpu)
uprobe-ret            (16 cpus):    2.575 ± 0.001M/s  (  0.161M/s/cpu)
uprobe-ret            (24 cpus):    1.808 ± 0.011M/s  (  0.075M/s/cpu)
uprobe-ret            (32 cpus):    1.853 ± 0.001M/s  (  0.058M/s/cpu)
uprobe-ret            (40 cpus):    1.952 ± 0.002M/s  (  0.049M/s/cpu)
uprobe-ret            (48 cpus):    2.075 ± 0.007M/s  (  0.043M/s/cpu)
uprobe-ret            (56 cpus):    2.441 ± 0.004M/s  (  0.044M/s/cpu)
uprobe-ret            (64 cpus):    1.880 ± 0.012M/s  (  0.029M/s/cpu)
uprobe-ret            (72 cpus):    0.962 ± 0.002M/s  (  0.013M/s/cpu)
uprobe-ret            (80 cpus):    1.040 ± 0.011M/s  (  0.013M/s/cpu)

uretprobe-ret         ( 1 cpus):    0.981 ± 0.000M/s  (  0.981M/s/cpu)
uretprobe-ret         ( 2 cpus):    1.421 ± 0.001M/s  (  0.711M/s/cpu)
uretprobe-ret         ( 3 cpus):    2.050 ± 0.003M/s  (  0.683M/s/cpu)
uretprobe-ret         ( 4 cpus):    2.596 ± 0.002M/s  (  0.649M/s/cpu)
uretprobe-ret         ( 5 cpus):    3.105 ± 0.003M/s  (  0.621M/s/cpu)
uretprobe-ret         ( 6 cpus):    3.886 ± 0.002M/s  (  0.648M/s/cpu)
uretprobe-ret         ( 7 cpus):    3.016 ± 0.001M/s  (  0.431M/s/cpu)
uretprobe-ret         ( 8 cpus):    2.903 ± 0.000M/s  (  0.363M/s/cpu)
uretprobe-ret         (10 cpus):    2.755 ± 0.001M/s  (  0.276M/s/cpu)
uretprobe-ret         (12 cpus):    2.400 ± 0.001M/s  (  0.200M/s/cpu)
uretprobe-ret         (14 cpus):    3.972 ± 0.001M/s  (  0.284M/s/cpu)
uretprobe-ret         (16 cpus):    3.940 ± 0.003M/s  (  0.246M/s/cpu)
uretprobe-ret         (24 cpus):    3.002 ± 0.003M/s  (  0.125M/s/cpu)
uretprobe-ret         (32 cpus):    3.018 ± 0.003M/s  (  0.094M/s/cpu)
uretprobe-ret         (40 cpus):    1.846 ± 0.000M/s  (  0.046M/s/cpu)
uretprobe-ret         (48 cpus):    2.487 ± 0.004M/s  (  0.052M/s/cpu)
uretprobe-ret         (56 cpus):    2.470 ± 0.006M/s  (  0.044M/s/cpu)
uretprobe-ret         (64 cpus):    2.027 ± 0.014M/s  (  0.032M/s/cpu)
uretprobe-ret         (72 cpus):    1.108 ± 0.011M/s  (  0.015M/s/cpu)
uretprobe-ret         (80 cpus):    0.982 ± 0.005M/s  (  0.012M/s/cpu)


-ret variants (single-stepping case for x86-64) still suck, but they
suck 2x less now with your patches :) Clearly more work ahead for
those, though.

Quick profiling shows that it's mostly xol_take_insn_slot() and
xol_free_insn_slot(), now. So it seems like your planned work might
help here.

Andrii, I'm glad we've reached a similar result, The profiling result on
my machine reveals that about 80% cycles spent on the atomic operations
on area->bitmap and area->slot_count. I guess the atomic access leads to
the intensive cacheline bouncing bewteen CPUs.

In the passed weekend, I have been working on another patch that optimizes
the xol_take_insn_slot() and xol_free_inns_slot() for better scalability.
This involves delaying the freeing of xol insn slots to reduce the times
of atomic operations and cacheline bouncing. Additionally, per-task refcounts
and an RCU-style management of linked-list of allocated insn slots. In short
summary, this patch try to replace coarse-grained atomic variables with
finer-grained ones, aiming to elimiate the expensive atomic instructions
in the hot path. If you or others have bandwidth and interest, I'd welcome
a brainstorming session on this topic.

Thanks.

quoted

  [0] https://github.com/anakryiko/linux/commits/uprobes-lockless-cumulative/

quoted

Thanks.

quoted

But also see some remarks below.

quoted

Thanks.

在 2024/7/27 17:44, Liao Chang 写道:

quoted

The profiling result of single-thread model of selftests bench reveals
performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
ARM64. On my local testing machine, 5% of CPU time is consumed by
find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.

This patch introduce struct uprobe_breakpoint to track previously
allocated insn_slot for frequently hit uprobe. it effectively reduce the
need for redundant insn_slot writes and subsequent expensive cache
flush, especially on architecture like ARM64. This patch has been tested
on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
bench and Redis GET/SET benchmark result below reveal obivious
performance gain.

before-opt
----------
trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)

I'm surprised that nop and push variants are much slower than ret
variant. This is exactly opposite on x86-64. Do you have an
explanation why this might be happening? I see you are trying to
optimize xol_get_insn_slot(), but that is (at least for x86) a slow
variant of uprobe that normally shouldn't be used. Typically uprobe is
installed on nop (for USDT) and on function entry (which would be push
variant, `push %rbp` instruction).

ret variant, for x86-64, causes one extra step to go back to user
space to execute original instruction out-of-line, and then trapping
back to kernel for running uprobe. Which is what you normally want to
avoid.

What I'm getting at here. It seems like maybe arm arch is missing fast
emulated implementations for nops/push or whatever equivalents for
ARM64 that is. Please take a look at that and see why those are slow
and whether you can make those into fast uprobe cases?

I will spend the weekend figuring out the questions you raised. Thanks for
pointing them out.

quoted

trig-uretprobe-nop:  0.331 ± 0.004M/s (0.331M/prod)
trig-uretprobe-push: 0.333 ± 0.000M/s (0.333M/prod)
trig-uretprobe-ret:  0.854 ± 0.002M/s (0.854M/prod)
Redis SET (RPS) uprobe: 42728.52
Redis GET (RPS) uprobe: 43640.18
Redis SET (RPS) uretprobe: 40624.54
Redis GET (RPS) uretprobe: 41180.56

after-opt
---------
trig-uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
trig-uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
trig-uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
trig-uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
trig-uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
trig-uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)
Redis SET (RPS) uprobe: 43939.69
Redis GET (RPS) uprobe: 45200.80
Redis SET (RPS) uretprobe: 41658.58
Redis GET (RPS) uretprobe: 42805.80

While some uprobes might still need to share the same insn_slot, this
patch compare the instructions in the resued insn_slot with the
instructions execute out-of-line firstly to decides allocate a new one
or not.

Additionally, this patch use a rbtree associated with each thread that
hit uprobes to manage these allocated uprobe_breakpoint data. Due to the
rbtree of uprobe_breakpoints has smaller node, better locality and less
contention, it result in faster lookup times compared to find_uprobe().

The other part of this patch are some necessary memory management for
uprobe_breakpoint data. A uprobe_breakpoint is allocated for each newly
hit uprobe that doesn't already have a corresponding node in rbtree. All
uprobe_breakpoints will be freed when thread exit.

Signed-off-by: Liao Chang <redacted>
---
 include/linux/uprobes.h |   3 +
 kernel/events/uprobes.c | 246 +++++++++++++++++++++++++++++++++-------
 2 files changed, 211 insertions(+), 38 deletions(-)

[...]

--
BR
Liao, Chang

-- 
BR
Liao, Chang

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help