Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
From: Valentin Schneider <vschneid@redhat.com>
Date: 2025-11-25 14:13:54
Also in:
linux-arch, linux-arm-kernel, linux-mm, linux-riscv, lkml, loongarch, rcu
On 21/11/25 09:50, Dave Hansen wrote:
On 11/21/25 09:37, Valentin Schneider wrote:quoted
On 19/11/25 10:31, Dave Hansen wrote:quoted
On 11/14/25 07:14, Valentin Schneider wrote:quoted
+static bool flush_tlb_kernel_cond(int cpu, void *info) +{ + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || + per_cpu(kernel_cr3_loaded, cpu); +}Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part of the instruction that actually sets CR3, there's a window between when 'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written. Is that OK? It seems like it could lead to both unnecessary IPIs being sent and for IPIs to be missed.So the pattern is SWITCH_TO_KERNEL_CR3 FLUSH KERNEL_CR3_LOADED := 1 KERNEL_CR3_LOADED := 0 SWITCH_TO_USER_CR3 The 0 -> 1 transition has a window between the unconditional flush and the write to 1 where a remote flush IPI may be omitted. Given that the write is immediately following the unconditional flush, that would really be just two flushes racing with each other,Let me fix that for you. When you wrote "a remote flush IPI may be omitted" you meant to write: "there's a bug." ;)
Something like that :-)
In the end, KERNEL_CR3_LOADED==0 means, "you don't need to send this CPU flushing IPIs because it will flush the TLB itself before touching memory that needs a flush". SWITCH_TO_KERNEL_CR3 FLUSH // On kernel CR3, *AND* not getting IPIs KERNEL_CR3_LOADED := 1quoted
but I could punt the kernel_cr3_loaded write above the unconditional flush.Yes, that would eliminate the window, as long as the memory ordering is right. You not only need to have the KERNEL_CR3_LOADED:=1 CPU set that variable, you need to ensure that it has seen the page table update.
I assumed the page table update would be a self-synchronizing operation, but that betrays how little I know about x86; /me goes back to reading
quoted
The 1 -> 0 transition is less problematic, worst case a remote flush races with the CPU returning to userspace and it'll get interrupted back to kernelspace.It's also not just "returning to userspace". It could well be *in* userspace by the point the IPI shows up. It's not the end of the world, and the window isn't infinitely long. But there certainly is still a possibility of getting spurious interrupts for the precious NOHZ_FULL task while it's in userspace.
IME it's okay if the application is just starting as it needs to do some initialization anyway (mlockall & friends), i.e. it's not executing actual useful payload from the get go. If it's resuming from an interference, well we'd be making things worse. I'm thinking the worst case is if this becomes a repeating pattern, but then that means even without those deferral hacks the isolated CPUs would be bombarded by IPIs in the first place.