Thread (50 messages) 50 messages, 6 authors, 2025-12-19

Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y

From: Valentin Schneider <vschneid@redhat.com>
Date: 2025-11-25 14:13:54
Also in: linux-arch, linux-arm-kernel, linux-mm, linux-riscv, lkml, loongarch, rcu

On 21/11/25 09:50, Dave Hansen wrote:
On 11/21/25 09:37, Valentin Schneider wrote:
quoted
On 19/11/25 10:31, Dave Hansen wrote:
quoted
On 11/14/25 07:14, Valentin Schneider wrote:
quoted
+static bool flush_tlb_kernel_cond(int cpu, void *info)
+{
+	return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) ||
+	       per_cpu(kernel_cr3_loaded, cpu);
+}
Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part
of the instruction that actually sets CR3, there's a window between when
'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written.

Is that OK?

It seems like it could lead to both unnecessary IPIs being sent and for
IPIs to be missed.
So the pattern is

  SWITCH_TO_KERNEL_CR3
  FLUSH
  KERNEL_CR3_LOADED := 1

  KERNEL_CR3_LOADED := 0
  SWITCH_TO_USER_CR3


The 0 -> 1 transition has a window between the unconditional flush and the
write to 1 where a remote flush IPI may be omitted. Given that the write is
immediately following the unconditional flush, that would really be just
two flushes racing with each other,
Let me fix that for you. When you wrote "a remote flush IPI may be
omitted" you meant to write: "there's a bug." ;)
Something like that :-)
In the end, KERNEL_CR3_LOADED==0 means, "you don't need to send this CPU
flushing IPIs because it will flush the TLB itself before touching
memory that needs a flush".

   SWITCH_TO_KERNEL_CR3
   FLUSH
   // On kernel CR3, *AND* not getting IPIs
   KERNEL_CR3_LOADED := 1
quoted
but I could punt the kernel_cr3_loaded
write above the unconditional flush.
Yes, that would eliminate the window, as long as the memory ordering is
right. You not only need to have the KERNEL_CR3_LOADED:=1 CPU set that
variable, you need to ensure that it has seen the page table update.
I assumed the page table update would be a self-synchronizing operation,
but that betrays how little I know about x86; /me goes back to reading
quoted
The 1 -> 0 transition is less problematic, worst case a remote flush races
with the CPU returning to userspace and it'll get interrupted back to
kernelspace.
It's also not just "returning to userspace". It could well be *in*
userspace by the point the IPI shows up. It's not the end of the world,
and the window isn't infinitely long. But there certainly is still a
possibility of getting spurious interrupts for the precious NOHZ_FULL
task while it's in userspace.
IME it's okay if the application is just starting as it needs to do some
initialization anyway (mlockall & friends), i.e. it's not executing actual
useful payload from the get go.

If it's resuming from an interference, well we'd be making things worse.

I'm thinking the worst case is if this becomes a repeating pattern, but
then that means even without those deferral hacks the isolated CPUs would
be bombarded by IPIs in the first place.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help