Re: [RFC PATCH] riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
From: Björn Töpel <bjorn@kernel.org>
Date: 2024-03-14 15:07:37
Also in:
linux-riscv, lkml
Puranjay Mohan [off-list ref] writes:
Björn Töpel [off-list ref] writes:quoted
Hmm, depending on RISC-V's CMODX path, the pro/cons CALL_OPS vs dynamic trampolines changes quite a bit. The more I look at the pains of patching two instruction ("split immediates"), the better "patch data" + one insn patching look.I was looking at how dynamic trampolines would be implemented for RISC-V. With CALL-OPS we need to patch the auipc+jalr at function entry only, the ops pointer above the function can be patched atomically. With a dynamic trampoline we need a auipc+jalr pair at function entry to jump to the trampoline and then another auipc+jalr pair to jump from trampoline to ops->func. When the ops->func is modified, we would need to update the auipc+jalr at in the trampoline. So, I am not sure how to move forward here, CALL-OPS or Dynamic trampolines?
Yeah. Honestly, we need to figure out the patching story prior choosing the path, so let's start there. After reading Mark's reply, and discussing with OpenJDK folks (who does the most crazy text patching on all platforms), having to patch multiple instructions (where the address materialization is split over multiple instructions) is a no-go. It's just a too big can of worms. So, if we can only patch one insn, it's CALL_OPS. A couple of options (in addition to Andy's), and all require a per-function landing address ala CALL_OPS) tweaking what Mark is doing on Arm (given the poor branch range). ...and maybe we'll get RISC-V rainbows/unicorns in the future getting better reach (full 64b! ;-)). A) Use auipc/jalr, only patch jalr to take us to a common dispatcher/trampoline | <func_trace_target_data_8B> # probably on a data cache-line != func .text to avoid ping-pong | ... | func: | ...make sure ra isn't messed up... | aupic | nop <=> jalr # Text patch point -> common_dispatch | ACTUAL_FUNC | | common_dispatch: | load <func_trace_target_data_8B> based on ra | jalr | ... The auipc is never touched, and will be overhead. Also, we need a mv to store ra in a scratch register as well -- like Arm. We'll have two insn per-caller overhead for a disabled caller. B) Use jal, which can only take us +/-1M, and requires multiple dispatchers (and tracking which one to use, and properly distribute them. Ick.) | <func_trace_target_data_8B> # probably on a data cache-line != func .text to avoid ping-pong | ... | func: | ...make sure ra isn't messed up... | nop <=> jal # Text patch point -> within_1M_to_func_dispatch | ACTUAL_FUNC | | within_1M_to_func_dispatch: | load <func_trace_target_data_8B> based on ra | jalr C) Use jal, which can only take us +/-1M, and use a per-function trampoline requires multiple dispatchers (and tracking which one to use). Blows up text size A LOT. | <func_trace_target_data_8B> # somewhere, but probably on a different cacheline than the .text to avoid ping-ongs | ... | per_func_dispatch | load <func_trace_target_data_8B> based on ra | jalr | func: | ...make sure ra isn't messed up... | nop <=> jal # Text patch point -> per_func_dispatch | ACTUAL_FUNC It's a bit sad that we'll always have to have a dispatcher/trampoline, but it's still better than stop_machine(). (And we'll need a fencei IPI as well, but only one. ;-)) Today, I'm leaning towards A (which is what Mark suggested, and also Robbin).. Any other options? [Now how do we implement OPTPROBES? I'm kidding. ;-)] Björn