Thread (5 messages) 5 messages, 3 authors, 2025-07-02

Re: [PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs

From: Peter Zijlstra <peterz@infradead.org>
Date: 2025-07-01 12:55:06
Also in: linux-arm-kernel, lkml

On Wed, Jun 25, 2025 at 02:08:22PM +0200, Gabriele Monaco wrote:
The irq_enable/irq_disable tracepoints fire only when there's an actual
transition (enabled->disabled and vice versa), this needs special care
in NMIs, as they can potentially start with interrupts already disabled.
The current implementation takes care of this by tracking the lockdep
state on nmi_entry as well as using the variable tracing_irq_cpu to
synchronise with other calls (e.g. local_irq_disable/enable).

This can be racy in case of NMIs when lockdep is enabled, and can lead
to missing events when lockdep is disabled.

Remove dependency on the lockdep status in the NMI common entry/exit
code and adapt the tracing code to make sure that:

- The first call disabling interrupts fires the tracepoint
- The first non-NMI call enabling interrupts fires the tracepoint
- The last NMI call enabling interrupts fires the tracepoint unless
  interrupts were disabled before the NMI
- All other calls don't fire
I'm not at all convinced this is correct. Nor can I understand anything
much about what you wrote above.

quoted hunk ↗ jump to hunk
 arch/arm64/kernel/entry-common.c |  5 ++---
 kernel/entry/common.c            |  5 ++---
 kernel/trace/trace_preemptirq.c  | 12 +++++++-----
 3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/arm64/kernel/entry-common.c b/arch/arm64/kernel/entry-common.c
index 7c1970b341b8c..7f1844123642e 100644
--- a/arch/arm64/kernel/entry-common.c
+++ b/arch/arm64/kernel/entry-common.c
@@ -213,10 +213,9 @@ static void noinstr arm64_exit_nmi(struct pt_regs *regs)
 	bool restore = regs->lockdep_hardirqs;
 
 	ftrace_nmi_exit();
-	if (restore) {
-		trace_hardirqs_on_prepare();
+	trace_hardirqs_on_prepare();
+	if (restore)
 		lockdep_hardirqs_on_prepare();
-	}
 
 	ct_nmi_exit();
 	lockdep_hardirq_exit();
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index a8dd1f27417cf..e234f264fb495 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -343,10 +343,9 @@ void noinstr irqentry_nmi_exit(struct pt_regs *regs, irqentry_state_t irq_state)
 {
 	instrumentation_begin();
 	ftrace_nmi_exit();
-	if (irq_state.lockdep) {
-		trace_hardirqs_on_prepare();
+	trace_hardirqs_on_prepare();
+	if (irq_state.lockdep)
 		lockdep_hardirqs_on_prepare();
-	}
 	instrumentation_end();
 
 	ct_nmi_exit();
diff --git a/kernel/trace/trace_preemptirq.c b/kernel/trace/trace_preemptirq.c
index 0c42b15c38004..fa45474fc54f1 100644
--- a/kernel/trace/trace_preemptirq.c
+++ b/kernel/trace/trace_preemptirq.c
@@ -58,7 +58,11 @@ static DEFINE_PER_CPU(int, tracing_irq_cpu);
  */
 void trace_hardirqs_on_prepare(void)
 {
-	if (this_cpu_read(tracing_irq_cpu)) {
+	int tracing_count = this_cpu_read(tracing_irq_cpu);
+
+	if (in_nmi() && tracing_count > 1)
+		this_cpu_dec(tracing_irq_cpu);
+	else if (tracing_count) {
 		trace(irq_enable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 		tracer_hardirqs_on(CALLER_ADDR0, CALLER_ADDR1);
 		this_cpu_write(tracing_irq_cpu, 0);
@@ -89,8 +93,7 @@ NOKPROBE_SYMBOL(trace_hardirqs_on);
  */
 void trace_hardirqs_off_finish(void)
 {
-	if (!this_cpu_read(tracing_irq_cpu)) {
-		this_cpu_write(tracing_irq_cpu, 1);
+	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 	}
@@ -103,8 +106,7 @@ void trace_hardirqs_off(void)
 {
 	lockdep_hardirqs_off(CALLER_ADDR0);
 
-	if (!this_cpu_read(tracing_irq_cpu)) {
-		this_cpu_write(tracing_irq_cpu, 1);
+	if (this_cpu_inc_return(tracing_irq_cpu) == 1) {
 		tracer_hardirqs_off(CALLER_ADDR0, CALLER_ADDR1);
 		trace(irq_disable, TP_ARGS(CALLER_ADDR0, CALLER_ADDR1));
 	}
So what about lovely things like:

trace_hardirqs_on_prepare()
  if (tracing_irq_cpu) {
    tracer_hardirqs_on();
    <NMI>
      trace_hardirqs_off_finish();
        if (this_cpu_inc_return() == 1) // will be > 1


So now we've traced IRQs are on, start an NMI, and loose the IRQs off
event. Well done.


This was all safe in that it would occasionally emit a duplicate state,
but no state was wrong. Both your attempts have broken things.

How about you fix you tool to accept duplicate state in the face of NMI
instead?
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help