Thread (5 messages) 5 messages, 3 authors, 2025-07-02

Re: [PATCH v2] tracing: Fix inconsistency in irq tracking on NMIs

From: Thomas Gleixner <hidden>
Date: 2025-07-02 09:39:39
Also in: linux-arm-kernel, lkml

On Wed, Jul 02 2025 at 09:18, Gabriele Monaco wrote:
On Tue, 2025-07-01 at 14:54 +0200, Peter Zijlstra wrote:
I could probably only fix this without even considering NMIs when
interrupts are disabled, but I believe if that happens, the tracepoints
would report something wrong, since using tracing_irq_cpu alone does:

local_irq_disable -> trace_irq_off
  nmi_enter -> no trace
  nmi_exit -> trace_irq_on
  // here interrupts are still off, aren't they?
local_irq_enable -> no trace

The idea that I described poorly, was to use tracing_irq_cpu in a way
that the first context disabling interrupts fires the tracepoint
(current behaviour), but when it's time to enable interrupts, an NMI
which didn't disable interrupts shouldn't trace but let the next
context trace.
...
[1] - https://lore.kernel.org/lkml/87sejup1fe.ffs@tglx (local)
As I told you before:

 "As you correctly described, the two states are asynchronous in the
  context of NMIs, so the obvious thing is to treat them as seperate
  entities so that the trace really contains the off/on events
  independent of lockdep enablement and lockdep's internal state. No?"

So I would have expected that you look at the lockdep implementation and
figure out why that one works correctly. Let's have a look at it:

nmi_entry()
   irq_state.lockdep = lockdep_hardirqs_enabled();

   lockdep_hardirqs_off(CALLER_ADDR0);
   ...
   return irq_state;

nmi_exit(irq_state)

   if (irq_state.lockdep)
	lockdep_hardirqs_on(CALLER_ADDR0);

On NMI entry the internal state of lockdep tracking is recorded and
lockdep_hardirqs_off() handles a redundant off gracefully. As I said it
might be arguable to invoke it only when irq_state.lockdep == true, but
that's a cosmetic or performance issue not a correctness problem.

On NMI exit lockdep_hardirqs_on() is only invoked when the lockdep internal
state at entry was 'enabled' so it can restore the state correctly.

If you model the tracer exactly the same way:

   irq_state.lockdep = lockdep_hardirqs_enabled();
   irq_state.tracer = tracer_hardirqs_enabled();
   ....

   You get the idea...

That will be "correct" in terms of sequencing depending on your trace side
implementation:

  trace_hardirqs_off()
        if (trace_hardirqs_enabled()) {
        	trace_irqs_enabled = false;     // A
                emit_trace(OFF);	        // B

If the NMI hits before A, everything is fine because it will record a
ON->OFF transition on NMI entry and a OFF->ON transition on NMI
exit. Then the interrupted context records a ON->OFF transition again.

If it hits between A and B, then the NMI tracing wont do anything
because enabled state is already false, but the resulting trace
sequencing is messed up:

     irqs on
     ...
     nmi_entry
     nmi_exit
     irqs off

IOW, it misses the ON->OFF OFF->ON transitions in the NMI.

So you want to do it the other way round:

  trace_hardirqs_off()
        if (trace_hardirqs_enabled()) {
                emit_trace(OFF);	        // A
        	trace_irqs_enabled = false;     // B

If the NMI hits before A, everything is fine because it will record a
ON->OFF transition on NMI entry and a OFF->ON transition on NMI
exit. Then the interrupted context records a ON->OFF transition again.

If it hits between A and B then you get a duplicate

     irqs on
     ...
     irqs off
     nmi_entry
     irqs off
     irqs on
     nmi_exit

There is nothing you can do about it, but that's easy enough to filter
out, no?

I'm pretty sure you can figure out how that should be modeled on the
on() side of affairs.

Thanks,

        tglx
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help