Thread (53 messages) 53 messages, 7 authors, 2026-01-07

Re: [RFC PATCH 1/2] rcu: Add rcu_read_lock_notrace()

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2025-07-17 15:30:18
Also in: bpf, linux-rt-devel, rcu

On Thu, Jul 17, 2025 at 07:57:27AM -0700, Alexei Starovoitov wrote:
On Thu, Jul 17, 2025 at 6:14 AM Mathieu Desnoyers
[off-list ref] wrote:
quoted
On 2025-07-16 18:54, Paul E. McKenney wrote:
quoted
On Wed, Jul 16, 2025 at 01:35:48PM -0700, Paul E. McKenney wrote:
quoted
On Wed, Jul 16, 2025 at 11:09:22AM -0400, Steven Rostedt wrote:
quoted
On Fri, 11 Jul 2025 10:05:26 -0700
"Paul E. McKenney" [off-list ref] wrote:
quoted
This trace point will invoke rcu_read_unlock{,_notrace}(), which will
note that preemption is disabled.  If rcutree.use_softirq is set and
this task is blocking an expedited RCU grace period, it will directly
invoke the non-notrace function raise_softirq_irqoff().  Otherwise,
it will directly invoke the non-notrace function irq_work_queue_on().
Just to clarify some things; A function annotated by "notrace" simply
will not have the ftrace hook to that function, but that function may
very well have tracing triggered inside of it.

Functions with "_notrace" in its name (like preempt_disable_notrace())
should not have any tracing instrumentation (as Mathieu stated)
inside of it, so that it can be used in the tracing infrastructure.

raise_softirq_irqoff() has a tracepoint inside of it. If we have the
tracing infrastructure call that, and we happen to enable that
tracepoint, we will have:

   raise_softirq_irqoff()
      trace_softirq_raise()
        [..]
          raise_softirq_irqoff()
             trace_softirq_raise()
                [..]
                  Ad infinitum!

I'm not sure if that's what is being proposed or not, but I just wanted
to make sure everyone is aware of the above.
OK, I *think* I might actually understand the problem.  Maybe.

I am sure that the usual suspects will not be shy about correcting any
misapprehensions in the following.  ;-)

My guess is that some users of real-time Linux would like to use BPF
programs while still getting decent latencies out of their systems.
(Not something I would have predicted, but then again, I was surprised
some years back to see people with a 4096-CPU system complaining about
200-microsecond latency blows from RCU.)  And the BPF guys (now CCed)
made some changes some years back to support this, perhaps most notably
replacing some uses of preempt_disable() with migrate_disable().

Except that the current __DECLARE_TRACE() macro defeats this work
for tracepoints by disabling preemption across the tracepoint call,
which might well be a BPF program.  So we need to do something to
__DECLARE_TRACE() to get the right sort of protection while still leaving
preemption enabled.

One way of attacking this problem is to use preemptible RCU.  The problem
with this is that although one could construct a trace-safe version
of rcu_read_unlock(), these would negate some optimizations that Lai
Jiangshan worked so hard to put in place.  Plus those optimizations
also simplified the code quite a bit.  Which is why I was pushing back
so hard, especially given that I did not realize that real-time systems
would be running BPF programs concurrently with real-time applications.
This meant that I was looking for a functional problem with the current
disabling of preemption, and not finding it.

So another way of dealing with this is to use SRCU-fast, which is
like SRCU, but dispenses with the smp_mb() calls and the redundant
read-side array indexing.  Plus it is easy to make _notrace variants
srcu_read_lock_fast_notrace() and srcu_read_unlock_fast_notrace(),
along with the requisite guards.

Re-introducing SRCU requires reverting most of e53244e2c893 ("tracepoint:
Remove SRCU protection"), and I have hacked together this and the
prerequisites mentioned in the previous paragraph.

These are passing ridiculously light testing, but probably have at
least their share of bugs.

But first, do I actually finally understand the problem?
OK, they pass somewhat less ridiculously moderate testing, though I have
not yet hit them over the head with the ftrace selftests.

So might as well post them.

Thoughts?
Your explanation of the problem context fits my understanding.

Note that I've mostly been pulled into this by Sebastian who wanted
to understand better the how we could make the tracepoint
instrumentation work with bpf probes that need to sleep due to
locking. Hence my original somewhat high-level desiderata.
I still don't understand what problem is being solved.
As current tracepoint code stands there is no issue with it at all
on PREEMPT_RT from bpf pov.
bpf progs that attach to tracepoints are not sleepable.
They don't call rt_spinlock either.
Recognizing tracepoints that can sleep/fault and allow
sleepable bpf progs there is on our to do list,
but afaik it doesn't need any changes to tracepoint infra.
There is no need to replace existing preempt_disable wrappers
with sleepable srcu_fast or anything else.
As I understand it, functionally there is no problem.

And if the BPF program has been attached to preemption-disabled code,
there is also no unnecessary damage to real-time latency.  Preemption is
presumably disabled for a reason, and that reason must be respected.
Presumably, people using BPF in real-time systems are careful to either
avoid attaching BPF programs to non-preemptible code on the one hand or
carefully limit the runtime of those BPF programs.

But if the BPF program has been attached to code having preemption
enabled, then the current call to guard(preempt_notrace)() within
__DECLARE_TRACE() means that the entire BPF program is running with
preemption disabled, and for no good reason.

Although this is not a functional issue, it is a first-class bug in
terms of needlessly degrading real-time latency.

And even though it isn't what Sebastian originally asked Mathieu for
help with, it still needs to be fixed.

My current offer is replacing that guard(preempt_notrace)() with its
SRCU-fast counterpart of guard(srcu_fast_notrace)(&tracepoint_srcu).

Light testing is going well thus far, though I clearly need the real-time
guys to both review and test.

I am sure that Mathieu, Sebastian, and the rest won't be shy about
correcting any confusion on my part.  ;-)

							Thanx, Paul
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help