Re: [LSF/MM/BPF TOPIC] faster uprobes

From: Andrii Nakryiko <hidden>
Date: 2024-03-11 23:06:02

On Mon, Mar 11, 2024 at 2:26 PM Jiri Olsa [off-list ref] wrote:

On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote:

quoted

On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa [off-list ref] wrote:

quoted

On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:

quoted

On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:

quoted

On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa [off-list ref] wrote:

quoted

On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:

quoted

On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
[off-list ref] wrote:

quoted

On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa [off-list ref] wrote:

quoted

On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:

quoted

On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa [off-list ref] wrote:

quoted

One of uprobe pain points is having slow execution that involves
two traps in worst case scenario or single trap if the original
instruction can be emulated. For return uprobes there's one extra
trap on top of that.

My current idea on how to make this faster is to follow the optimized
kprobes and replace the normal uprobe trap instruction with jump to
user space trampoline that:

  - executes syscall to call uprobe consumers callbacks

Did you get a chance to measure relative performance of syscall vs
int3 interrupt handling? If not, do you think you'll be able to get
some numbers by the time the conference starts? This should inform the
decision whether it even makes sense to go through all the trouble.

right, will do that

I believe Yusheng measured syscall vs uprobe performance
difference during LPC. iirc it was something like 3x.

Do you have a link to slides? Was it actual uprobe vs just some fast
syscall (not doing BPF program execution) comparison? Or comparing the
performance of int3 handling vs equivalent syscall handling.

I suspect it's the former, and so probably not that representative.
I'm curious about the performance of going
userspace->kernel->userspace through int3 vs syscall (all other things
being equal).

I have a simple test [1] comparing:
  - uprobe with 2 traps
  - uprobe with 1 trap
  - syscall executing uprobe

the syscall takes uprobe address as argument, finds the uprobe and executes
its consumers, which should be comparable to what the trampoline will do

test does same amount of loops triggering each uprobe type and measures
the time it took

  # ./test_progs -t uprobe_syscall_bench -v
  bpf_testmod.ko is already unloaded.
  Loading bpf_testmod.ko...
  Successfully loaded bpf_testmod.ko.
  test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
  test_bench_1:PASS:uprobe_bench__attach 0 nsec
  test_bench_1:PASS:uprobe1_cnt 0 nsec
  test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
  test_bench_1:PASS:uprobe2_cnt 0 nsec
  test_bench_1: uprobes (1 trap) in  36.439s
  test_bench_1: uprobes (2 trap) in  91.960s
  test_bench_1: syscalls         in  17.872s
  #395/1   uprobe_syscall_bench/bench_1:OK
  #395     uprobe_syscall_bench:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
and ~5x faster than 2 traps uprobe

Thanks for running benchmarks! I quickly looked at the selftest and
noticed this:

+/*
+ * Assuming following prolog:
+ *
+ * 6984ac:       55                      push   %rbp
+ * 6984ad:       48 89 e5                mov    %rsp,%rbp
+ */
+noinline void uprobe2_bench_trigger(void)
+{
+        asm volatile ("");
+}

This actually will be optimized out to just ret in -O2 mode (make
RELEASE=1 for selftests):

00000000005a0ce0 <uprobe2_bench_trigger>:
  5a0ce0: c3                            retq
  5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00      nopw    %cs:(%rax,%rax)
  5a0cec: 0f 1f 40 00                   nopl    (%rax)

So be careful with that.

right, I did not mean for this to be checked in, just wanted to get the
numbers quickly

quoted

Also, I just updated our existing set of uprobe benchmarks (see [0]),
do you mind adding your syscall-based one as another one there and
running all of them and sharing the numbers with us? Very curious to
see both absolute and relative numbers from that benchmark. (and
please do build with RELEASE=1)

You should be able to just run benchs/run_bench_uprobes.sh (also don't
forget to add your syscall-based benchmark to the list of benchmarks
in that shell script).

yes, saw it and was going to run/compare it.. it's good idea to add
the syscall one and get all numbers together, will do that

quoted

Thank you!


BTW, while I think patching multiple instructions for syscall-based
uprobe is going to be extremely tricky, I think at least u*ret*probe's
int3 can be pretty easily optimized away with syscall, given that the
kernel controls code generation there. If anything, it will get the
uretprobe case a bit closer to the performance of uprobe. Give it some
thought.

hm, right.. the trampoline is there already, but at the moment is global
and used by all uretprobes.. and int3 code moves userspace (changes rip)
to the original return address.. maybe we can do that through syscall
as well

it seems like good idea, I tried change below (use syscall on return
trampoline) and got some speedup:

current:
  base           :   15.817 ± 0.009M/s
  uprobe-nop     :    2.901 ± 0.000M/s
  uprobe-push    :    2.743 ± 0.002M/s
  uprobe-ret     :    1.089 ± 0.001M/s
  uretprobe-nop  :    1.448 ± 0.001M/s
  uretprobe-push :    1.407 ± 0.001M/s
  uretprobe-ret  :    0.792 ± 0.001M/s

with syscall:
  base           :   15.831 ± 0.026M/s
  uprobe-nop     :    2.904 ± 0.001M/s
  uprobe-push    :    2.764 ± 0.002M/s
  uprobe-ret     :    1.082 ± 0.001M/s
  uretprobe-nop  :    1.785 ± 0.000M/s
  uretprobe-push :    1.733 ± 0.001M/s
  uretprobe-ret  :    0.885 ± 0.004M/s

~23% for nop/push (emulated) cases, ~11% for ret (sstep) case

jirka

heh, I tried this as well over weekend, though I cut few more corners
(see diff below, I didn't add saving/restoring rax, though that would
be required, of course). My test machine is (way) slower, though, so I
got a slightly different numbers (up to 15%):

nice :-) btw I just checked on another slower amd server and it's ~10% in
all 3 cases, my previous results are from intel machine.. I guess the hw
trap behaviour/speed makes this not proportional across archs

quoted

### baseline
uprobe-base    :   79.462 ± 0.058M/s
base           :    2.920 ± 0.004M/s
uprobe-nop     :    1.093 ± 0.001M/s
uprobe-push    :    1.066 ± 0.001M/s
uprobe-ret     :    0.480 ± 0.001M/s
uretprobe-nop  :    0.555 ± 0.000M/s
uretprobe-push :    0.549 ± 0.000M/s
uretprobe-ret  :    0.338 ± 0.000M/s


### uretprobe syscall (vs baseline)
uprobe-base    :   79.488 ± 0.033M/s
base           :    2.917 ± 0.003M/s
uprobe-nop     :    1.095 ± 0.001M/s
uprobe-push    :    1.058 ± 0.000M/s
uprobe-ret     :    0.483 ± 0.000M/s
uretprobe-nop  :    0.638 ± 0.000M/s (+15%)
uretprobe-push :    0.627 ± 0.000M/s (+14.2%)
uretprobe-ret  :    0.366 ± 0.000M/s (+8.3%)

Either way, yes, we should implement this. Are you planning to send an
official patch some time soon? I'm working on other small improvements
in uprobe/uretprobe, I'll probably send the first patches
today/tomorrow, but they shouldn't interfere with this uretprobe code
path.

yes, wanted to finish/post it this week

great, looking forward, we can use these speeds up for uretprobe in
our production

SNIP

quoted

---

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7e8d46f4147f..fa5f8a058bc2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl

@@ -383,6 +383,7 @@
 459    common  lsm_get_self_attr       sys_lsm_get_self_attr
 460    common  lsm_set_self_attr       sys_lsm_set_self_attr
 461    common  lsm_list_modules        sys_lsm_list_modules
+462    64      uprobe                  sys_uprobe

we should call it "uretprobe", "uprobe" will be a separate thing with
different logic.

I went with generic "trace", but realized that it would be better to
have separate more targeted "special/internal" syscalls (where, if
necessary, extra arguments would be passed through stack to avoid
storing/restoring user-space registers). We have rg_sigreturn
precedent which explicitly states that userspace shouldn't use it and
shouldn't rely on any specific arguments conventions.

somehow I thought of syscalls as of scare resource and wanted to add
arguments/commands to the uprobe syscalls.. but having uretprobe
dedicated syscall makes things easier

given we are at 462 already, not sure I believe it's scarce :) but
it's also a performance aspect, not using any arguments means we can
avoid saving/restoring user-space registers, so it makes sense to have
2-3 dedicated uprobe/uretprobe syscalls vs 1 more generic on (IMO), at
least from performance POV.

quoted

[...]

quoted

 /*
  * Deprecated system calls which are still defined in

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..9ef244c8ff19 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h

@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c
 extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs);
 extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
                                         void *src, unsigned long len);
+extern void uprobe_handle_trampoline(struct pt_regs *regs);
+uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);

just `void *` here? it can be a sequence of instructions now

hm, it's pointer to u8, which should be fine no? is there benefit to
have void* in here instead?

Quick grepping initially brought up `typedef u32 uprobe_opcode_t;`,
but that's for non-x86 architectures. I don't think it matters all
that much (in my mind it's just a generated code blob, so just `void
*` memory, we don't have to look at its contents).

thanks,
jirka

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help