Re: [PATCH] uprobes: Optimize the allocation of insn_slot for performance

From: Liao, Chang <hidden>
Date: 2024-08-21 08:46:52
Also in: bpf, linux-perf-users, lkml


在 2024/8/16 0:53, Andrii Nakryiko 写道:

On Wed, Aug 14, 2024 at 7:58 PM Liao, Chang [off-list ref] wrote:

quoted



在 2024/8/15 2:42, Andrii Nakryiko 写道:

quoted

On Tue, Aug 13, 2024 at 9:17 PM Liao, Chang [off-list ref] wrote:

quoted



在 2024/8/13 1:49, Andrii Nakryiko 写道:

quoted

On Mon, Aug 12, 2024 at 4:11 AM Liao, Chang [off-list ref] wrote:

quoted



在 2024/8/9 2:26, Andrii Nakryiko 写道:

quoted

On Thu, Aug 8, 2024 at 1:45 AM Liao, Chang [off-list ref] wrote:

quoted

Hi Andrii and Oleg.

This patch sent by me two weeks ago also aim to optimize the performance of uprobe
on arm64. I notice recent discussions on the performance and scalability of uprobes
within the mailing list. Considering this interest, I've added you and other relevant
maintainers to the CC list for broader visibility and potential collaboration.

Hi Liao,

As you can see there is an active work to improve uprobes, that
changes lifetime management of uprobes, removes a bunch of locks taken
in the uprobe/uretprobe hot path, etc. It would be nice if you can
hold off a bit with your changes until all that lands. And then
re-benchmark, as costs might shift.

But also see some remarks below.

quoted

Thanks.

在 2024/7/27 17:44, Liao Chang 写道:

quoted

The profiling result of single-thread model of selftests bench reveals
performance bottlenecks in find_uprobe() and caches_clean_inval_pou() on
ARM64. On my local testing machine, 5% of CPU time is consumed by
find_uprobe() for trig-uprobe-ret, while caches_clean_inval_pou() take
about 34% of CPU time for trig-uprobe-nop and trig-uprobe-push.

This patch introduce struct uprobe_breakpoint to track previously
allocated insn_slot for frequently hit uprobe. it effectively reduce the
need for redundant insn_slot writes and subsequent expensive cache
flush, especially on architecture like ARM64. This patch has been tested
on Kunpeng916 (Hi1616), 4 NUMA nodes, 64 cores@ 2.4GHz. The selftest
bench and Redis GET/SET benchmark result below reveal obivious
performance gain.

before-opt
----------
trig-uprobe-nop:  0.371 ± 0.001M/s (0.371M/prod)
trig-uprobe-push: 0.370 ± 0.001M/s (0.370M/prod)
trig-uprobe-ret:  1.637 ± 0.001M/s (1.647M/prod)

I'm surprised that nop and push variants are much slower than ret
variant. This is exactly opposite on x86-64. Do you have an
explanation why this might be happening? I see you are trying to
optimize xol_get_insn_slot(), but that is (at least for x86) a slow
variant of uprobe that normally shouldn't be used. Typically uprobe is
installed on nop (for USDT) and on function entry (which would be push
variant, `push %rbp` instruction).

ret variant, for x86-64, causes one extra step to go back to user
space to execute original instruction out-of-line, and then trapping
back to kernel for running uprobe. Which is what you normally want to
avoid.

What I'm getting at here. It seems like maybe arm arch is missing fast
emulated implementations for nops/push or whatever equivalents for
ARM64 that is. Please take a look at that and see why those are slow
and whether you can make those into fast uprobe cases?

Hi Andrii,

As you correctly pointed out, the benchmark result on Arm64 is counterintuitive
compared to X86 behavior. My investigation revealed that the root cause lies in
the arch_uprobe_analyse_insn(), which excludes the Arm64 equvialents instructions
of 'nop' and 'push' from the emulatable instruction list. This forces the kernel
to handle these instructions out-of-line in userspace upon breakpoint exception
is handled, leading to a significant performance overhead compared to 'ret' variant,
which is already emulated.

To address this issue, I've developed a patch supports  the emulation of 'nop' and
'push' variants. The benchmark results below indicates the performance gain of
emulation is obivious.

xol (1 cpus)
------------
uprobe-nop:  0.916 ± 0.001M/s (0.916M/prod)
uprobe-push: 0.908 ± 0.001M/s (0.908M/prod)
uprobe-ret:  1.855 ± 0.000M/s (1.855M/prod)
uretprobe-nop:  0.640 ± 0.000M/s (0.640M/prod)
uretprobe-push: 0.633 ± 0.001M/s (0.633M/prod)
uretprobe-ret:  0.978 ± 0.003M/s (0.978M/prod)

emulation (1 cpus)
-------------------
uprobe-nop:  1.862 ± 0.002M/s  (1.862M/s/cpu)
uprobe-push: 1.743 ± 0.006M/s  (1.743M/s/cpu)
uprobe-ret:  1.840 ± 0.001M/s  (1.840M/s/cpu)
uretprobe-nop:  0.964 ± 0.004M/s  (0.964M/s/cpu)
uretprobe-push: 0.936 ± 0.004M/s  (0.936M/s/cpu)
uretprobe-ret:  0.940 ± 0.001M/s  (0.940M/s/cpu)

As you can see, the performance gap between nop/push and ret variants has been significantly
reduced. Due to the emulation of 'push' instruction need to access userspace memory, it spent
more cycles than the other.

Great, it's an obvious improvement. Are you going to send patches
upstream? Please cc bpf@vger.kernel.org as well.

I'll need more time to thoroughly test this patch. The emulation o push/nop
instructions also impacts the kprobe/kretprobe paths on Arm64, As as result,
I'm working on enhancements to trig-kprobe/kretprobe to prevent performance
regression.

Why would the *benchmarks* have to be modified? The typical
kprobe/kretprobe attachment should be fast, and those benchmarks
simulate typical fast path kprobe/kretprobe. Is there some simulation
logic that is shared between uprobes and kprobes or something?

Yes, kprobe and uprobe share many things for Arm64, but there are curical
difference. Let me explain further. Simulating a 'push' instruction on
arm64 will modify the stack pointer at *probe breakpoint. However, kprobe
and uprobe use different way to restore the stack pointer upon returning
from the breakpoint exception. Consequently.sharing the same simulation
logic for both would result in kernel panic for kprobe.

To avoid complicating the exception return logic, I've opted to simuate
'push' only for uprobe and maintain the single-stepping for kprobe [0].
This trade-off avoid the impacts to kprobe/kretprobe, and no need to
change the kprobe/kretprobe related benchmark.

I see, thanks for explaining. I noticed the "bool kernel" flag you
added, it makes sense.

I still don't understand why you'd need to modify kprobe/kretprobe
benchmarks, as they are testing attaching kprobe at the kernel
function entry, which for kernels should be an optimized case not
requiring any emulation.

Sorry about the confusion. I've revised the implementation of nop/push
emulation to avoid the impacts the kprobe/kretprobe on Arm64, see the
change to arm_probe_decode_insn() in the patch [0]. As a result, no
changes to the kprobe/kretprobe benchmarks.

[0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/ (local)

Thanks.

quoted

[0] https://lore.kernel.org/all/20240814080356.2639544-1-liaochang1@huawei.com/ (local)

quoted


I'm also thinking we should update uprobe/uretprobe benchmarks to be
less x86-specific. Right now "-nop" is the happy fastest case, "-push"
is still happy, slightly slower case (due to the need to emulate stack
operation) and "-ret" is meant to be the slow single-step case. We
should adjust the naming and make sure that on ARM64 we hit similar
code paths. Given you seem to know arm64 pretty well, can you please
take a look at updating bench tool for ARM64 (we can also rename
benchmarks to something a bit more generic, rather than using
instruction names)?

[...]

--
BR
Liao, Chang

-- 
BR
Liao, Chang

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help