Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2025-07-08 20:49:06
Also in: linux-rt-devel, rcu

On Tue, Jul 08, 2025 at 03:40:05PM -0400, Mathieu Desnoyers wrote:

On 2025-07-07 17:56, Paul E. McKenney wrote:

quoted

On Mon, Jun 23, 2025 at 11:13:03AM -0700, Paul E. McKenney wrote:

quoted

On Mon, Jun 23, 2025 at 12:49:41PM +0200, Sebastian Andrzej Siewior wrote:

quoted

On 2025-06-20 04:23:49 [-0700], Paul E. McKenney wrote:

quoted

I hope not because it is not any different from

  	CPU 2			CPU 3
  	=====                   =====
	NMI
  	rcu_read_lock();
  				synchronize_rcu();
  				// need all CPUs report a QS.
  	rcu_read_unlock();
  	// no rcu_read_unlock_special() due to in_nmi().

If the NMI happens while the CPU is in userland (say a perf event) then
the NMI returns directly to userland.
After the tracing event completes (in this case) the CPU should run into
another RCU section on its way out via context switch or the tick
interrupt.
I assume the tick interrupt is what makes the NMI case work.

Are you promising that interrupts will be always be disabled across
the whole rcu_read_lock_notrace() read-side critical section?  If so,
could we please have a lockdep_assert_irqs_disabled() call to check that?

No, that should stay preemptible because bpf can attach itself to
tracepoints and this is the root cause of the exercise. Now if you say
it has to be run with disabled interrupts to match the NMI case then it
makes sense (since NMIs have interrupts off) but I do not understand why
it matters here (since the CPU returns to userland without passing the
kernel).

Given your patch, if you don't disable interrupts in a preemptible kernel
across your rcu_read_lock_notrace()/rcu_read_unlock_notrace() pair, then a
concurrent expedited grace period might send its IPI in the middle of that
critical section.  That IPI handler would set up state so that the next
rcu_preempt_deferred_qs_irqrestore() would report the quiescent state.
Except that without the call to rcu_read_unlock_special(), there might
not be any subsequent call to rcu_preempt_deferred_qs_irqrestore().

This is even more painful if this is a CONFIG_PREEMPT_RT kernel.
Then if that critical section was preempted and then priority-boosted,
the unboosting also won't happen until the next call to that same
rcu_preempt_deferred_qs_irqrestore() function, which again might not
happen.  Or might be significantly delayed.

Or am I missing some trick that fixes all of this?

quoted

I'm not sure how much can be done here due to the notrace part. Assuming
rcu_read_unlock_special() is not doable, would forcing a context switch
(via setting need-resched and irq_work, as the IRQ-off case) do the
trick?
Looking through rcu_preempt_deferred_qs_irqrestore() it does not look to
be "usable from the scheduler (with rq lock held)" due to RCU-boosting
or the wake of expedited_wq (which is one of the requirement).

But if rq_lock is held, then interrupts are disabled, which will
cause the unboosting to be deferred.

Or are the various deferral mechanisms also unusable in this context?

OK, looking back through this thread, it appears that you need both
an rcu_read_lock_notrace() and an rcu_read_unlock_notrace() that are
covered by Mathieu's list of requirements [1]:

I'm just jumping into this email thread, so I'll try to clarify
what I may.

Thank you for jumping in!

quoted

| - NMI-safe
	This is covered by the existing rcu_read_lock() and
	rcu_read_unlock().

OK

quoted

| - notrace
	I am guessing that by "notrace", you mean the "notrace" CPP
	macro attribute defined in include/linux/compiler_types.h.
	This has no fewer than four different definitions, so I will
	need some help understanding what the restrictions are.

When I listed notrace in my desiderata, I had in mind the
preempt_{disable,enable}_notrace macros, which ensure that
those macros don't call instrumentation, and therefore can be
used from the implementation of tracepoint instrumentation or
from a tracer callback.

OK, that makes sense.  I have some questions:

Are interrupts disabled at the calls to rcu_read_unlock_notrace()?

If interrupts are instead enabled, is it OK for an IRQ-work handler that
did an immediate self-interrupt and a bunch of non-notrace work?

If interrupts are to be enabled, but an immediate IRQ-work handler is
ruled out, what mechanism should I be using to defer the work, given
that it cannot be deferred for very long without messing up real-time
latencies?  As in a delay of nanoseconds or maybe a microsecond, but
not multiple microseconds let alone milliseconds?

(I could imagine short-timeout hrtimer, but I am not seeing notrace
variants of these guys, either.)

quoted

| - usable from the scheduler (with rq lock held)
	This is covered by the existing rcu_read_lock() and
	rcu_read_unlock().

OK

quoted

| - usable to trace the RCU implementation
	This one I don't understand.  Can I put tracepoints on
	rcu_read_lock_notrace() and rcu_read_unlock_notrace() or can't I?
	I was assuming that tracepoints would be forbidden.  Until I
	reached this requirement, that is.

At a high level, a rcu_read_{lock,unlock}_notrace should be something
that is safe to call from the tracepoint implementation or from a
tracer callback.

My expectation is that the "normal" (not notrace) RCU APIs would
be instrumented with tracepoints, and tracepoints and tracer callbacks
are allowed to use the _notrace RCU APIs.

This provides instrumentation coverage of RCU with the exception of the
_notrace users, which is pretty much the best we can hope for without
having a circular dependency.

OK, got it, and that does make sense.  I am not sure how I would have
intuited that from the description, but what is life without a challenge?

quoted

One possible path forward is to ensure that rcu_read_unlock_special()
calls only functions that are compatible with the notrace/trace
requirements.  The ones that look like they might need some help are
raise_softirq_irqoff() and irq_work_queue_on().  Note that although
rcu_preempt_deferred_qs_irqrestore() would also need help, it is easy to
avoid its being invoked, for example, by disabing interrupts across the
call to rcu_read_unlock_notrace().  Or by making rcu_read_unlock_notrace()
do the disabling.

However, I could easily be missing something, especially given my being
confused by the juxtaposition of "notrace" and "usable to trace the
RCU implementation".  These appear to me to be contradicting each other.

Help?

You indeed need to ensure that everything that is called from
rcu_{lock,unlock]_notrace don't end up executing instrumentation
to prevent a circular dependency. You hinted at a few ways to achieve
this. Other possible approaches:

- Add a "trace" bool parameter to rcu_read_unlock_special(),
- Duplicate rcu_read_unlock_special() and introduce a notrace symbol.

OK, both of these are reasonable alternatives for the API, but it will
still be necessary to figure out how to make the notrace-incompatible
work happen.

- Keep some nesting count in the task struct to prevent calling the
  instrumentation when nested in notrace,

OK, for this one, is the idea to invoke some TBD RCU API the tracing
exits the notrace region?  I could see that working.  But there would
need to be a guarantee that if the notrace API was invoked, a call to
this TBD RCU API would follow in short order.  And I suspect that
preemption (and probably also interrupts) would need to be disabled
across this region.

There are probably other possible approaches I am missing, each with
their respective trade offs.

I am pretty sure that we also have some ways to go before we have the
requirements fully laid out, for that matter.  ;-)

Could you please tell me where in the current tracing code these
rcu_read_lock_notrace()/rcu_read_unlock_notrace() calls would be placed?

							Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help