Re: [RFC PATCH v7 30/31] x86/mm, mm/vmalloc: Defer kernel TLB flush IPIs under CONFIG_COALESCE_TLBI=y
From: Dave Hansen <hidden>
Date: 2025-11-21 17:50:16
Also in:
linux-arch, linux-arm-kernel, linux-mm, linux-riscv, lkml, loongarch, rcu
On 11/21/25 09:37, Valentin Schneider wrote:
On 19/11/25 10:31, Dave Hansen wrote:quoted
On 11/14/25 07:14, Valentin Schneider wrote:quoted
+static bool flush_tlb_kernel_cond(int cpu, void *info) +{ + return housekeeping_cpu(cpu, HK_TYPE_KERNEL_NOISE) || + per_cpu(kernel_cr3_loaded, cpu); +}Is it OK that 'kernel_cr3_loaded' can be be stale? Since it's not part of the instruction that actually sets CR3, there's a window between when 'kernel_cr3_loaded' is set (or cleared) and CR3 is actually written. Is that OK? It seems like it could lead to both unnecessary IPIs being sent and for IPIs to be missed.So the pattern is SWITCH_TO_KERNEL_CR3 FLUSH KERNEL_CR3_LOADED := 1 KERNEL_CR3_LOADED := 0 SWITCH_TO_USER_CR3 The 0 -> 1 transition has a window between the unconditional flush and the write to 1 where a remote flush IPI may be omitted. Given that the write is immediately following the unconditional flush, that would really be just two flushes racing with each other,
Let me fix that for you. When you wrote "a remote flush IPI may be omitted" you meant to write: "there's a bug." ;) In the end, KERNEL_CR3_LOADED==0 means, "you don't need to send this CPU flushing IPIs because it will flush the TLB itself before touching memory that needs a flush". SWITCH_TO_KERNEL_CR3 FLUSH // On kernel CR3, *AND* not getting IPIs KERNEL_CR3_LOADED := 1
but I could punt the kernel_cr3_loaded write above the unconditional flush.
Yes, that would eliminate the window, as long as the memory ordering is right. You not only need to have the KERNEL_CR3_LOADED:=1 CPU set that variable, you need to ensure that it has seen the page table update.
The 1 -> 0 transition is less problematic, worst case a remote flush races with the CPU returning to userspace and it'll get interrupted back to kernelspace.
It's also not just "returning to userspace". It could well be *in* userspace by the point the IPI shows up. It's not the end of the world, and the window isn't infinitely long. But there certainly is still a possibility of getting spurious interrupts for the precious NOHZ_FULL task while it's in userspace.