Re: [RFC PATCH 6/9] powerpc/ftrace: Update and move function profile instructions out-of-line
From: Christophe Leroy <hidden>
Date: 2023-12-21 10:46:12
Also in:
lkml
Le 08/12/2023 à 17:30, Naveen N Rao a écrit :
Function profile sequence on powerpc includes two instructions at the
beginning of each function:
mflr r0
bl ftrace_caller
The call to ftrace_caller() gets nop'ed out during kernel boot and is
patched in when ftrace is enabled.
There are two issues with this:
1. The 'mflr r0' instruction at the beginning of each function remains
even though ftrace is not being used.
2. When ftrace is activated, we return from ftrace_caller() with a bctr
instruction to preserve r0 and LR, resulting in the link stack
becoming unbalanced.
To address (1), we have tried to nop'out the 'mflr r0' instruction when
nop'ing out the call to ftrace_caller() and restoring it when enabling
ftrace. But, that required additional synchronization slowing down
ftrace activation. It also left an additional nop instruction at the
beginning of each function and that wasn't desirable on 32-bit powerpc.
Instead of that, move the function profile sequence out-of-line leaving
a single nop at function entry. On ftrace activation, the nop is changed
to an unconditional branch to the out-of-line sequence that in turn
calls ftrace_caller(). This removes the need for complex synchronization
during ftrace activation and simplifies the code. More importantly, this
improves performance of the kernel when ftrace is not in use.
To address (2), change the ftrace trampoline to return with a 'blr'
instruction with the original return address in r0 intact. Then, an
additional 'mtlr r0' instruction in the function profile sequence can
move the correct return address back to LR.
With the above two changes, the function profile sequence now looks like
the following:
[func: # GEP -- 64-bit powerpc, optional
addis r2,r12,imm1
addi r2,r2,imm2]
tramp:
mflr r0
bl ftrace_caller
mtlr r0
b func
nop
[nop] # 64-bit powerpc only
func: # LEP
nop
On 32-bit powerpc, the ftrace mcount trampoline is now completely
outside the function. This is also the case on 64-bit powerpc for
functions that do not need a GEP. However, for functions that need a
GEP, the additional instructions are inserted between the GEP and the
LEP. Since we can only have a fixed number of instructions between GEP
and LEP, we choose to emit 6 instructions. Four of those instructions
are used for the function profile sequence and two instruction slots are
reserved for implementing support for DYNAMIC_FTRACE_WITH_CALL_OPS. On
32-bit powerpc, we emit one additional nop for this purpose resulting in
a total of 5 nops before function entry.
To enable ftrace, the nop at function entry is changed to an
unconditional branch to 'tramp'. The call to ftrace_caller() may be
updated to ftrace_regs_caller() depending on the registered ftrace ops.
On 64-bit powerpc, we additionally change the instruction at 'tramp' to
'mflr r0' from an unconditional branch back to func+4. This is so that
functions entered through the GEP can skip the function profile sequence
unless ftrace is enabled.
With the context_switch microbenchmark on a P9 machine, there is a
performance improvement of ~6% with this patch applied, going from 650k
context switches to 690k context switches without ftrace enabled. With
ftrace enabled, the performance was similar at 86k context switches.Wondering how significant that context_switch micorbenchmark is. I ran it on both mpc885 and mpc8321 and I'm a bit puzzled by some of the results: # ./context_switch --no-fp Using threads with yield on cpus 0/0 touching FP:no altivec:no vector:no vdso:no On 885, I get the following results before and after your patch. CONFIG_FTRACE not selected : 44,9k CONFIG_FTRACE selected, before : 32,8k CONFIG_FTRACE selected, after : 33,6k All this is with CONFIG_INIT_STACK_ALL_ZERO which is the default. But when I select CONFIG_INIT_STACK_NONE, the CONFIG_FTRACE not selected result is only 34,4. On 8321: CONFIG_FTRACE not selected : 100,3k CONFIG_FTRACE selected, before : 72,5k CONFIG_FTRACE selected, after : 116k So the results look odd to me.
The downside of this approach is the increase in vmlinux size, especially on 32-bit powerpc. We now emit 3 additional instructions for each function (excluding the one or two instructions for supporting DYNAMIC_FTRACE_WITH_CALL_OPS). On 64-bit powerpc with the current implementation of -fpatchable-function-entry though, this is not avoidable since we are forced to emit 6 instructions between the GEP and the LEP even if we are to only support DYNAMIC_FTRACE_WITH_CALL_OPS.
The increase is almost 5% on the few 32 bits defconfig I have tested. That's a lot. On 32 bits powerpc, only the e500 has a link stack that could end up being unbalanced. Could we keep the bctr and avoid the mtlr and the jump in the trampoline ? On 8xx a NOP is one cycle, a taken branch is 2 cycles, but the second cycle is a bubble that most of the time gets filled by following operations. On the 83xx, branches and NOPs are supposed to be seamless. So is that out-of-line trampoline really worth it ? Maybe keeping the ftrace instructions at the begining and just replacing the mflr by an jump when ftrace is off would help reduce the ftrace insns by one more instruction.
Signed-off-by: Naveen N Rao <naveen@kernel.org> ---
quoted hunk ↗ jump to hunk
diff --git a/arch/powerpc/include/asm/code-patching.h b/arch/powerpc/include/asm/code-patching.h index 84f6ccd7de7a..9a54bb9e0dde 100644 --- a/arch/powerpc/include/asm/code-patching.h +++ b/arch/powerpc/include/asm/code-patching.h@@ -185,10 +185,21 @@ static inline unsigned long ppc_function_entry(void *func) */ if ((((*insn & OP_RT_RA_MASK) == ADDIS_R2_R12) || ((*insn & OP_RT_RA_MASK) == LIS_R2)) && - ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) + ((*(insn+1) & OP_RT_RA_MASK) == ADDI_R2_R2)) { +#ifdef CONFIG_ARCH_USING_PATCHABLE_FUNCTION_ENTRY
Can you replace by IS_ENABLED() ?
+ /*
+ * Heuristic: look for the 'mtlr r0' instruction assuming ftrace init is done.
+ * If it is not found, look for two consecutive nops after the GEP.
+ * Longer term, we really should be parsing the symbol table to determine LEP.
+ */
+ if ((*(insn+4) == PPC_RAW_MTLR(_R0)) ||
+ ((*(insn+2) == PPC_RAW_NOP() && *(insn+3) == PPC_RAW_NOP())))
+ return (unsigned long)(insn + 8);
+#endif
return (unsigned long)(insn + 2);
- else
+ } else {
return (unsigned long)func;
+ }
#elif defined(CONFIG_PPC64_ELF_ABI_V1)
/*
* On PPC64 ABIv1 the function pointer actually points to thequoted hunk ↗ jump to hunk
diff --git a/arch/powerpc/kernel/trace/ftrace.c b/arch/powerpc/kernel/trace/ftrace.c index 2956196c98ff..d3b4949142a8 100644 --- a/arch/powerpc/kernel/trace/ftrace.c +++ b/arch/powerpc/kernel/trace/ftrace.c
quoted hunk ↗ jump to hunk
@@ -217,15 +274,62 @@ int ftrace_init_nop(struct module *mod, struct dyn_ftrace *rec) { unsigned long addr, ip = rec->ip; ppc_inst_t old, new; - int ret = 0; + int i, ret = 0; + u32 ftrace_mcount_tramp_insns[] = { +#ifdef CONFIG_PPC64 + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET + MCOUNT_INSN_SIZE), +#else + PPC_RAW_MFLR(_R0), +#endif
Can it be based on IS_ENABLED(CONFIG_PPC64) instead ?
+ PPC_RAW_BL(0), /* bl ftrace_caller */ + PPC_RAW_MTLR(_R0), /* also see update ppc_function_entry() */ + PPC_RAW_BRANCH(FTRACE_MCOUNT_TRAMP_OFFSET - MCOUNT_INSN_SIZE * 2) + }; +