Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
From: Jiri Olsa <hidden>
Date: 2024-11-16 21:44:56
Also in:
bpf, lkml
On Thu, Nov 14, 2024 at 03:44:20PM -0800, Andrii Nakryiko wrote:
On Tue, Nov 5, 2024 at 5:35 AM Jiri Olsa [off-list ref] wrote:quoted
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline and overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we could use nop5 as USDT probe instruction (which uses single byte nop ATM) and speed up the USDT probes.As discussed offline, it's not as simple as just replacing nop1 with nop5 in USDT definition due to performance considerations on old kernels (nop5 isn't fast as far as uprobe is concerned), but I think we'll be able to accommodate transparent "nop1 or nop5" behavior in user space, we'll just need a careful backwards compatible extension to USDT definition. BTW, do you plan to send an optimization for nop5 to avoid single-stepping them? Ideally we'd just handle any-sized nop, so we don't have to do this "nop1 or nop5" acrobatics in the future.
I'll prepare that, but I'd like to check on the alternative calls you suggested first
quoted
This patch overloads related arch functions in uprobe_write_opcode and set_orig_insn so they can install call instruction if needed. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done.fun... ideally we wouldn't do this lazily, I just came up with another possible idea, but let's keep all this discussion to another thread with Peterquoted
TODO release uprobe trampoline when it's no longer needed.. we might need to stop all cpus to make sure no user space thread is in the trampoline.. or we might just keep it, because there's just one 4GB memory region?4KB not 4GB, right? Yeah, probably leaving it until process exists is totally fine.
yep, ok SNIP
quoted
+#include <asm/nops.h> /* Post-execution fixups. */@@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = { .emulate = push_emulate_op, }; +static int is_nop5_insns(uprobe_opcode_t *insn)insns -> insn?quoted
+{ + return !memcmp(insn, x86_nops[5], 5); +} + +static int is_call_insns(uprobe_opcode_t *insn)ditto, insn, singular?
ok SNIP
quoted
+int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, + uprobe_opcode_t *new_opcode, void *opt) +{ + if (opt) { + uprobe_opcode_t old_opcode[5]; + bool is_call; + + uprobe_copy_from_page(page, vaddr, (uprobe_opcode_t *) &old_opcode, 5); + is_call = is_call_insns((uprobe_opcode_t *) &old_opcode); + + if (is_call_insns(new_opcode)) { + if (is_call) /* register: already installed? */probably should check that the destination of the call instruction is what we expect?
ok SNIP
quoted
+bool arch_uprobe_is_callable(unsigned long vtramp, unsigned long vaddr) +{ + unsigned long delta; + + /* call instructions size */ + vaddr += 5; + delta = vaddr < vtramp ? vtramp - vaddr : vaddr - vtramp; + return delta < 0xffffffff;isn't immediate a sign extended 32-bit value (that is, int)? wouldn't this work and be correct: long delta = (long)(vaddr + 5 - vtramp); return delta >= INT_MIN && delta <= INT_MAX; ?
ah, right.. should be sign value :-\ thanks jirka