Re: [PATCH v3 2/2] tracing/preemptirq: Optimize preempt_disable/enable() tracepoint overhead
From: Wander Lairson Costa <hidden>
Date: 2025-08-01 13:30:23
Also in:
lkml
On Tue, Jul 8, 2025 at 3:54 PM Peter Zijlstra [off-list ref] wrote:
On Tue, Jul 08, 2025 at 09:54:06AM -0300, Wander Lairson Costa wrote:quoted
O Mon, Jul 07, 2025 at 01:20:03PM +0200, Peter Zijlstra wrote:quoted
On Fri, Jul 04, 2025 at 02:07:43PM -0300, Wander Lairson Costa wrote:quoted
Similar to the IRQ tracepoint, the preempt tracepoints are typically disabled in production systems due to the significant overhead they introduce even when not in use. The overhead primarily comes from two sources: First, when tracepoints are compiled into the kernel, preempt_count_add() and preempt_count_sub() become external function calls rather than inlined operations. Second, these functions perform unnecessary preempt_count() checks even when the tracepoint itself is disabled. This optimization introduces an early check of the tracepoint static key, which allows us to skip both the function call overhead and the redundant preempt_count() checks when tracing is disabled. The change maintains all existing functionality when tracing is active while significantly reducing overhead for the common case where tracing is inactive.This one in particular I worry about the code gen impact. There are a *LOT* of preempt_{dis,en}able() sites in the kernel and now they all get this static branch and call crud on. We spend significant effort to make preempt_{dis,en}able() as small as possible.Thank you for the feedback, it's much appreciated. I just want to make sure I'm on the right track. If I understand your concern correctly, it revolves around the overhead this patch might introduce???specifically to the binary size and its effect on the iCache???when the kernel is built with preempt tracepoints enabled. Is that an accurate summary?Yes, specifically: preempt_disable() incl %gs:__preempt_count preempt_enable() decl %gs:__preempt_count jz do_schedule 1: ... do_schedule: call __SCT__preemptible_schedule jmp 1 your proposal adds significantly to this.
Here is a breakdown of the patch's behavior under the different kernel configurations: * When DEBUG_PREEMPT is defined, the behavior is identical to the current implementation, with calls to preempt_count_add/sub(). * When both DEBUG_PREEMPT and TRACE_PREEMPT_TOGGLE are disabled, the generated code is also unchanged. * The primary change occurs when only TRACE_PREEMPT_TOGGLE is defined. In this case, the code uses a static key test instead of a function call. As the benchmarks show, this approach is faster when the tracepoints are disabled. The main trade-off is that enabling or disabling these tracepoints will require the kernel to patch more code locations due to the use of static keys.