Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a... | linux-trace-kernel

Re: [PATCH v7 00/31] context_tracking,x86: Defer some IPIs until a user->kernel transition

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2025-11-14 18:14:04
Also in: linux-arch, linux-arm-kernel, linux-mm, linux-riscv, lkml, loongarch, rcu

On Fri, Nov 14, 2025 at 09:22:35AM -0800, Andy Lutomirski wrote:


On Fri, Nov 14, 2025, at 8:20 AM, Andy Lutomirski wrote:

quoted

On Fri, Nov 14, 2025, at 7:01 AM, Valentin Schneider wrote:

quoted

Context
=======

We've observed within Red Hat that isolated, NOHZ_FULL CPUs running a
pure-userspace application get regularly interrupted by IPIs sent from
housekeeping CPUs. Those IPIs are caused by activity on the housekeeping CPUs
leading to various on_each_cpu() calls, e.g.:

quoted

The heart of this series is the thought that while we cannot remove NOHZ_FULL
CPUs from the list of CPUs targeted by these IPIs, they may not have to execute
the callbacks immediately. Anything that only affects kernelspace can wait
until the next user->kernel transition, providing it can be executed "early
enough" in the entry code.

I want to point out that there's another option here, although anyone 
trying to implement it would be fighting against quite a lot of history.

Logically, each CPU is in one of a handful of states: user mode, idle, 
normal kernel mode (possibly subdivided into IRQ, etc), and a handful 
of very narrow windows, hopefully uninstrumented and not accessing any 
PTEs that might be invalid, in the entry and exit paths where any state 
in memory could be out of sync with actual CPU state.  (The latter 
includes right after the CPU switches to kernel mode, for example.)  
And NMI and MCE and whatever weird "security" entry types that Intel 
and AMD love to add.

The way the kernel *currently* deals with this has two big historical oddities:

1. The entry and exit code cares about ti_flags, which is per-*task*, 
which means that atomically poking it from other CPUs involves the 
runqueue lock or other shenanigans (see the idle nr_polling code for 
example), and also that it's not accessible from the user page tables 
if PTI is on.

2. The actual heavyweight atomic part (context tracking) was built for 
RCU, and it's sort or bolted on, and, as you've observed in this 
series, it's really quite awkward to do things that aren't RCU using 
context tracking.

If this were a greenfield project, I think there's a straightforward 
approach that's much nicer: stick everything into a single percpu flags 
structure.  Imagine we have cpu_flags, which tracks both the current 
state of the CPU and what work needs to be done on state changes.  On 
exit to user mode, we would atomically set the mode to USER and make 
sure we don't touch anything like vmalloc space after that.  On entry 
back to kernel mode, we would avoid vmalloc space, etc, then atomically 
switch to kernel mode and read out whatever deferred work is needed.  
As an optimization, if nothing in the current configuration needs 
atomic state tracking, the state could be left at USER_OR_KERNEL and 
the overhead of an extra atomic op at entry and exit could be avoided.

And RCU would hook into *that* instead of having its own separate set of hooks.

Please note that RCU needs to sample a given CPU's idle state from other
CPUs, and to have pretty heavy-duty ordering guarantees.  This is needed
to avoid RCU needing to wake up idle CPUs on the one hand or relying on
scheduling-clock interrupts waking up idle CPUs on the other.

Or am I missing the point of your suggestion?

quoted

I think that actually doing this would be a big improvement and would 
also be a serious project.  There's a lot of code that would get 
touched, and the existing context tracking code is subtle and 
confusing.  And, as mentioned, ti_flags has the wrong scope.

Serious care would certainly be needed!  ;-)

quoted

It's *possible* that one could avoid making ti_flags percpu either by 
extensive use of the runqueue locks or by borrowing a kludge from the 
idle code.  For the latter, right now, the reason that the 
wake-from-idle code works is that the optimized path only happens if 
the idle thread/cpu is "polling", and it's impossible for the idle 
ti_flags to be polling while the CPU isn't actually idle.  We could 
similarly observe that, if a ti_flags says it's in USER mode *and* is 
on, say, cpu 3, then cpu 3 is most definitely in USER mode.  So someone 
could try shoving the CPU number into ti_flags :-p   (USER means 
actually user or in the late exit / early entry path.)

Anyway, benefits of this whole approach would include considerably 
(IMO) increased comprehensibility compared to the current tangled ct 
code and much more straightforward addition of new things that happen 
to a target CPU conditionally depending on its mode.  And, if the flags 
word was actually per cpu, it could be mapped such that 
SWITCH_TO_KERNEL_CR3 would use it -- there could be a single CR3 write 
(and maybe CR4/invpcid depending on whether a zapped mapping is global) 
and the flush bit could depend on whether a flush is needed.  And there 
would be basically no chance that a bug that accessed 
invalidated-but-not-flushed kernel data could be undetected -- in PTI 
mode, any such access would page fault!  Similarly, if kernel text 
pokes deferred the flush and serialization, the only code that could 
execute before noticing the deferred flush would be the user-CR3 code.

Oh, any another primitive would be possible: one CPU could plausibly 
execute another CPU's interrupts or soft-irqs or whatever by taking a 
special lock that would effectively pin the remote CPU in user mode -- 
you'd set a flag in the target cpu_flags saying "pin in USER mode" and 
the transition on that CPU to kernel mode would then spin on entry to 
kernel mode and wait for the lock to be released.  This could plausibly 
get a lot of the on_each_cpu callers to switch over in one fell swoop: 
anything that needs to synchronize to the remote CPU but does not need 
to poke its actual architectural state could be executed locally while 
the remote CPU is pinned.

It would be necessary to arrange for the remote CPU to remain pinned
while the local CPU executed on its behalf.  Does the above approach
make that happen without re-introducing our current context-tracking
overhead and complexity?

Following up, I think that x86 can do this all with a single atomic (in the common case) per usermode round trip.  Imagine:

struct fancy_cpu_state {
  u32 work; // <-- writable by any CPU
  u32 status; // <-- readable anywhere; writable locally
};

status includes KERNEL, USER, and maybe INDETERMINATE.  (INDETERMINATE means USER but we're not committing to doing work.)

Exit to user mode:

atomic_set(&my_state->status, USER);

We need ordering in the RCU nohz_full case.  If the grace-period kthread
sees the status as USER, all the preceding KERNEL code's effects must
be visible to the grace-period kthread.

(or, in the lazy case, set to INDETERMINATE instead.)

Entry from user mode, with IRQs off, before switching to kernel CR3:

if (my_state->status == INDETERMINATE) {
  // we were lazy and we never promised to do work atomically.
  atomic_set(&my_state->status, KERNEL);
  this_entry_work = 0;
} else {
  // we were not lazy and we promised we would do work atomically
  atomic exchange the entire state to { .work = 0, .status = KERNEL }
  this_entry_work = (whatever we just read);
}

If this atomic exchange is fully ordered (as opposed to, say, _relaxed),
then this works in that if the grace-period kthread sees USER, its prior
references are guaranteed not to see later kernel-mode references from
that CPU.

if (PTI) {
  switch to kernel CR3 *and flush if this_entry_work says to flush*
} else {
  flush if this_entry_work says to flush;
}

do the rest of the work;



I suppose that a lot of the stuff in ti_flags could merge into here, but it could be done one bit at a time when people feel like doing so.  And I imagine, but I'm very far from confident, that RCU could use this instead of the current context tracking code.

RCU currently needs pretty heavy-duty ordering to reliably detect the
other CPUs' quiescent states without needing to wake them from idle, or,
in the nohz_full case, interrupt their userspace execution.  Not saying
it is impossible, but it will need extreme care.

The idea behind INDETERMINATE is that there are plenty of workloads that frequently switch between user and kernel mode and that would rather accept a few IPIs to avoid the heavyweight atomic operation on user -> kernel transitions.  So the default behavior could be to do KERNEL -> INDETERMINATE instead of KERNEL -> USER, but code that wants to be in user mode for a long time could go all the way to USER.  We could make it sort of automatic by noticing that we're returning from an IRQ without a context switch and go to USER (so we would get at most one unneeded IPI per normal user entry), and we could have some nice API for a program that intends to hang out in user mode for a very long time (cpu isolation users, for example) to tell the kernel to go immediately into USER mode.  (Don't we already have something that could be used for this purpose?)

RCU *could* do an smp_call_function_single() when the CPU failed
to respond, perhaps in a manner similar to how it already forces a
given CPU out of nohz_full state if that CPU has been executing in the
kernel for too long.  The real-time guys might not be amused, though.
Especially those real-time guys hitting sub-microsecond latencies.

Hmm, now I wonder if it would make sense for the default behavior of Linux to be like that.  We could call it ONEHZ.  It's like NOHZ_FULL except that user threads that don't do syscalls get one single timer tick instead of many or none.


Anyway, I think my proposal is pretty good *if* RCU could be made to use it -- the existing context tracking code is fairly expensive, and I don't think we want to invent a new context-tracking-like mechanism if we still need to do the existing thing.

If you build with CONFIG_NO_HZ_FULL=n, do you still get the heavyweight
operations when transitioning between kernel and user execution?

							Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help