Re: [RFC perf/core 07/11] uprobes/x86: Add support to optimize uprobes
From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Date: 2024-11-18 08:18:13
Also in:
bpf, lkml
On Tue, 5 Nov 2024 14:34:01 +0100 Jiri Olsa [off-list ref] wrote:
quoted hunk ↗ jump to hunk
Putting together all the previously added pieces to support optimized uprobes on top of 5-byte nop instruction. The current uprobe execution goes through following: - installs breakpoint instruction over original instruction - exception handler hit and calls related uprobe consumers - and either simulates original instruction or does out of line single step execution of it - returns to user space The optimized uprobe path - checks the original instruction is 5-byte nop (plus other checks) - adds (or uses existing) user space trampoline and overwrites original instruction (5-byte nop) with call to user space trampoline - the user space trampoline executes uprobe syscall that calls related uprobe consumers - trampoline returns back to next instruction This approach won't speed up all uprobes as it's limited to using nop5 as original instruction, but we could use nop5 as USDT probe instruction (which uses single byte nop ATM) and speed up the USDT probes. This patch overloads related arch functions in uprobe_write_opcode and set_orig_insn so they can install call instruction if needed. The arch_uprobe_optimize triggers the uprobe optimization and is called after first uprobe hit. I originally had it called on uprobe installation but then it clashed with elf loader, because the user space trampoline was added in a place where loader might need to put elf segments, so I decided to do it after first uprobe hit when loading is done. TODO release uprobe trampoline when it's no longer needed.. we might need to stop all cpus to make sure no user space thread is in the trampoline.. or we might just keep it, because there's just one 4GB memory region? Signed-off-by: Jiri Olsa <jolsa@kernel.org> --- arch/x86/include/asm/uprobes.h | 7 ++ arch/x86/kernel/uprobes.c | 130 +++++++++++++++++++++++++++++++++ include/linux/uprobes.h | 1 + kernel/events/uprobes.c | 3 + 4 files changed, 141 insertions(+)diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h index 678fb546f0a7..84a75ed748f0 100644 --- a/arch/x86/include/asm/uprobes.h +++ b/arch/x86/include/asm/uprobes.h@@ -20,6 +20,11 @@ typedef u8 uprobe_opcode_t; #define UPROBE_SWBP_INSN 0xcc #define UPROBE_SWBP_INSN_SIZE 1 +enum { + ARCH_UPROBE_FLAG_CAN_OPTIMIZE = 0, + ARCH_UPROBE_FLAG_OPTIMIZED = 1, +}; + struct uprobe_xol_ops; struct arch_uprobe {@@ -45,6 +50,8 @@ struct arch_uprobe { u8 ilen; } push; }; + + unsigned long flags; }; struct arch_uprobe_task {diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c index 02aa4519b677..50ccf24ff42c 100644 --- a/arch/x86/kernel/uprobes.c +++ b/arch/x86/kernel/uprobes.c@@ -18,6 +18,7 @@ #include <asm/processor.h> #include <asm/insn.h> #include <asm/mmu_context.h> +#include <asm/nops.h> /* Post-execution fixups. */@@ -877,6 +878,33 @@ static const struct uprobe_xol_ops push_xol_ops = { .emulate = push_emulate_op, }; +static int is_nop5_insns(uprobe_opcode_t *insn) +{ + return !memcmp(insn, x86_nops[5], 5);
Maybe better to use BYTES_NOP5 directly?
+}
+
+static int is_call_insns(uprobe_opcode_t *insn)
+{
+ return *insn == 0xe8;0xe8 -> CALL_INSN_OPCODE Thank you, -- Masami Hiramatsu (Google) [off-list ref]