Re: [LSF/MM/BPF TOPIC] faster uprobes
From: Andrii Nakryiko <hidden>
Date: 2024-03-11 23:06:02
On Mon, Mar 11, 2024 at 2:26 PM Jiri Olsa [off-list ref] wrote:
On Mon, Mar 11, 2024 at 10:32:23AM -0700, Andrii Nakryiko wrote:quoted
On Mon, Mar 11, 2024 at 3:59 AM Jiri Olsa [off-list ref] wrote:quoted
On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:quoted
On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:quoted
On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa [off-list ref] wrote:quoted
On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:quoted
On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov [off-list ref] wrote:quoted
On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa [off-list ref] wrote:quoted
On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:quoted
On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa [off-list ref] wrote:quoted
One of uprobe pain points is having slow execution that involves two traps in worst case scenario or single trap if the original instruction can be emulated. For return uprobes there's one extra trap on top of that. My current idea on how to make this faster is to follow the optimized kprobes and replace the normal uprobe trap instruction with jump to user space trampoline that: - executes syscall to call uprobe consumers callbacksDid you get a chance to measure relative performance of syscall vs int3 interrupt handling? If not, do you think you'll be able to get some numbers by the time the conference starts? This should inform the decision whether it even makes sense to go through all the trouble.right, will do thatI believe Yusheng measured syscall vs uprobe performance difference during LPC. iirc it was something like 3x.Do you have a link to slides? Was it actual uprobe vs just some fast syscall (not doing BPF program execution) comparison? Or comparing the performance of int3 handling vs equivalent syscall handling. I suspect it's the former, and so probably not that representative. I'm curious about the performance of going userspace->kernel->userspace through int3 vs syscall (all other things being equal).I have a simple test [1] comparing: - uprobe with 2 traps - uprobe with 1 trap - syscall executing uprobe the syscall takes uprobe address as argument, finds the uprobe and executes its consumers, which should be comparable to what the trampoline will do test does same amount of loops triggering each uprobe type and measures the time it took # ./test_progs -t uprobe_syscall_bench -v bpf_testmod.ko is already unloaded. Loading bpf_testmod.ko... Successfully loaded bpf_testmod.ko. test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec test_bench_1:PASS:uprobe_bench__attach 0 nsec test_bench_1:PASS:uprobe1_cnt 0 nsec test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec test_bench_1:PASS:uprobe2_cnt 0 nsec test_bench_1: uprobes (1 trap) in 36.439s test_bench_1: uprobes (2 trap) in 91.960s test_bench_1: syscalls in 17.872s #395/1 uprobe_syscall_bench/bench_1:OK #395 uprobe_syscall_bench:OK Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED syscall uprobe execution seems to be ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobeThanks for running benchmarks! I quickly looked at the selftest and noticed this: +/* + * Assuming following prolog: + * + * 6984ac: 55 push %rbp + * 6984ad: 48 89 e5 mov %rsp,%rbp + */ +noinline void uprobe2_bench_trigger(void) +{ + asm volatile (""); +} This actually will be optimized out to just ret in -O2 mode (make RELEASE=1 for selftests): 00000000005a0ce0 <uprobe2_bench_trigger>: 5a0ce0: c3 retq 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax) 5a0cec: 0f 1f 40 00 nopl (%rax) So be careful with that.right, I did not mean for this to be checked in, just wanted to get the numbers quicklyquoted
Also, I just updated our existing set of uprobe benchmarks (see [0]), do you mind adding your syscall-based one as another one there and running all of them and sharing the numbers with us? Very curious to see both absolute and relative numbers from that benchmark. (and please do build with RELEASE=1) You should be able to just run benchs/run_bench_uprobes.sh (also don't forget to add your syscall-based benchmark to the list of benchmarks in that shell script).yes, saw it and was going to run/compare it.. it's good idea to add the syscall one and get all numbers together, will do thatquoted
Thank you! BTW, while I think patching multiple instructions for syscall-based uprobe is going to be extremely tricky, I think at least u*ret*probe's int3 can be pretty easily optimized away with syscall, given that the kernel controls code generation there. If anything, it will get the uretprobe case a bit closer to the performance of uprobe. Give it some thought.hm, right.. the trampoline is there already, but at the moment is global and used by all uretprobes.. and int3 code moves userspace (changes rip) to the original return address.. maybe we can do that through syscall as wellit seems like good idea, I tried change below (use syscall on return trampoline) and got some speedup: current: base : 15.817 ± 0.009M/s uprobe-nop : 2.901 ± 0.000M/s uprobe-push : 2.743 ± 0.002M/s uprobe-ret : 1.089 ± 0.001M/s uretprobe-nop : 1.448 ± 0.001M/s uretprobe-push : 1.407 ± 0.001M/s uretprobe-ret : 0.792 ± 0.001M/s with syscall: base : 15.831 ± 0.026M/s uprobe-nop : 2.904 ± 0.001M/s uprobe-push : 2.764 ± 0.002M/s uprobe-ret : 1.082 ± 0.001M/s uretprobe-nop : 1.785 ± 0.000M/s uretprobe-push : 1.733 ± 0.001M/s uretprobe-ret : 0.885 ± 0.004M/s ~23% for nop/push (emulated) cases, ~11% for ret (sstep) case jirkaheh, I tried this as well over weekend, though I cut few more corners (see diff below, I didn't add saving/restoring rax, though that would be required, of course). My test machine is (way) slower, though, so I got a slightly different numbers (up to 15%):nice :-) btw I just checked on another slower amd server and it's ~10% in all 3 cases, my previous results are from intel machine.. I guess the hw trap behaviour/speed makes this not proportional across archsquoted
### baseline uprobe-base : 79.462 ± 0.058M/s base : 2.920 ± 0.004M/s uprobe-nop : 1.093 ± 0.001M/s uprobe-push : 1.066 ± 0.001M/s uprobe-ret : 0.480 ± 0.001M/s uretprobe-nop : 0.555 ± 0.000M/s uretprobe-push : 0.549 ± 0.000M/s uretprobe-ret : 0.338 ± 0.000M/s ### uretprobe syscall (vs baseline) uprobe-base : 79.488 ± 0.033M/s base : 2.917 ± 0.003M/s uprobe-nop : 1.095 ± 0.001M/s uprobe-push : 1.058 ± 0.000M/s uprobe-ret : 0.483 ± 0.000M/s uretprobe-nop : 0.638 ± 0.000M/s (+15%) uretprobe-push : 0.627 ± 0.000M/s (+14.2%) uretprobe-ret : 0.366 ± 0.000M/s (+8.3%) Either way, yes, we should implement this. Are you planning to send an official patch some time soon? I'm working on other small improvements in uprobe/uretprobe, I'll probably send the first patches today/tomorrow, but they shouldn't interfere with this uretprobe code path.yes, wanted to finish/post it this week
great, looking forward, we can use these speeds up for uretprobe in our production
SNIPquoted
quoted
---diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 7e8d46f4147f..fa5f8a058bc2 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl@@ -383,6 +383,7 @@ 459 common lsm_get_self_attr sys_lsm_get_self_attr 460 common lsm_set_self_attr sys_lsm_set_self_attr 461 common lsm_list_modules sys_lsm_list_modules +462 64 uprobe sys_uprobewe should call it "uretprobe", "uprobe" will be a separate thing with different logic. I went with generic "trace", but realized that it would be better to have separate more targeted "special/internal" syscalls (where, if necessary, extra arguments would be passed through stack to avoid storing/restoring user-space registers). We have rg_sigreturn precedent which explicitly states that userspace shouldn't use it and shouldn't rely on any specific arguments conventions.somehow I thought of syscalls as of scare resource and wanted to add arguments/commands to the uprobe syscalls.. but having uretprobe dedicated syscall makes things easier
given we are at 462 already, not sure I believe it's scarce :) but it's also a performance aspect, not using any arguments means we can avoid saving/restoring user-space registers, so it makes sense to have 2-3 dedicated uprobe/uretprobe syscalls vs 1 more generic on (IMO), at least from performance POV.
quoted
[...]quoted
/* * Deprecated system calls which are still defined indiff --git a/include/linux/uprobes.h b/include/linux/uprobes.h index f46e0ca0169c..9ef244c8ff19 100644 --- a/include/linux/uprobes.h +++ b/include/linux/uprobes.h@@ -138,6 +138,8 @@ extern bool arch_uretprobe_is_alive(struct return_instance *ret, enum rp_check c extern bool arch_uprobe_ignore(struct arch_uprobe *aup, struct pt_regs *regs); extern void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, void *src, unsigned long len); +extern void uprobe_handle_trampoline(struct pt_regs *regs); +uprobe_opcode_t* arch_uprobe_trampoline(unsigned long *psize);just `void *` here? it can be a sequence of instructions nowhm, it's pointer to u8, which should be fine no? is there benefit to have void* in here instead?
Quick grepping initially brought up `typedef u32 uprobe_opcode_t;`, but that's for non-x86 architectures. I don't think it matters all that much (in my mind it's just a generated code blob, so just `void *` memory, we don't have to look at its contents).
thanks, jirka