Re: [PATCH v1] arm64/mm: Ensure lazy_mmu_mode never nests
From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2025-06-10 13:41:04
Also in:
lkml
On 10/06/2025 13:00, Catalin Marinas wrote:
On Fri, Jun 06, 2025 at 02:56:52PM +0100, Ryan Roberts wrote:quoted
Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested") provided a quick fix to ensure that lazy_mmu_mode continues to work when CONFIG_DEBUG_PAGEALLOC is enabled, which can cause lazy_mmu_mode to nest. The solution in that patch is the make the implementation tolerant tos/is the make/is to make/quoted
nesting; when the inner nest exits lazy_mmu_mode, we exit then the outer exit becomes a nop. But this sacrifices the optimization opportunity for the remainder of the outer user.[...]quoted
I wonder if you might be willing to take this for v6.16? I think its a neater solution then my first attempt - Commit 1ef3095b1405 ("arm64/mm: Permit lazy_mmu_mode to be nested") - which is already in Linus's master. To be clear, the current solution is safe, I just think this is much neater.Maybe better, though I wouldn't say much neater. One concern I have is about whether we'll get other such nesting in the future and we need to fix them in generic code. Here we control __kernel_map_pages() but we may not for other cases. Is it the fault of the arch code that uses apply_to_page_range() via __kernel_map_pages()? It feels like it shouldn't care about the lazy mode as that's some detail of the apply_to_page_range() implementation. Maybe this API should just allow nesting.
I don't think it is possible to properly support nesting:
enter_lazy_mmu
for_each_pte {
read/modify-write pte
alloc_page
enter_lazy_mmu
make page valid
exit_lazy_mmu
write_to_page
}
exit_lazy_mmu
This example only works because lazy_mmu doesn't support nesting. The "make page
valid" operation is completed by the time of the inner exit_lazy_mmu so that the
page can be accessed in write_to_page. If nesting was supported, the inner
exit_lazy_mmu would become a nop and write_to_page would explode.
So the conclusion I eventually came to (after being nudged by Mike Rapoport at
[1]) is that this _is_ arm64's fault for creating a loop via
apply_to_page_range(). So I'm trying to fix this by breaking the loop.
[1] https://lore.kernel.org/all/aDqz7H-oBo35FRXe@kernel.org/ (local)
quoted
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 88db8a0c0b37..9f387337ccc3 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h@@ -83,21 +83,11 @@ static inline void queue_pte_barriers(void) #define __HAVE_ARCH_ENTER_LAZY_MMU_MODE static inline void arch_enter_lazy_mmu_mode(void) { - /* - * lazy_mmu_mode is not supposed to permit nesting. But in practice this - * does happen with CONFIG_DEBUG_PAGEALLOC, where a page allocation - * inside a lazy_mmu_mode section (such as zap_pte_range()) will change - * permissions on the linear map with apply_to_page_range(), which - * re-enters lazy_mmu_mode. So we tolerate nesting in our - * implementation. The first call to arch_leave_lazy_mmu_mode() will - * flush and clear the flag such that the remainder of the work in the - * outer nest behaves as if outside of lazy mmu mode. This is safe and - * keeps tracking simple. - */ - if (in_interrupt()) return; + VM_WARN_ON(test_thread_flag(TIF_LAZY_MMU));This warning is good to have back.quoted
+ set_thread_flag(TIF_LAZY_MMU); }@@ -119,6 +109,14 @@ static inline void arch_leave_lazy_mmu_mode(void) clear_thread_flag(TIF_LAZY_MMU); } +static inline bool arch_in_lazy_mmu_mode(void) +{ + if (in_interrupt()) + return false; + + return test_thread_flag(TIF_LAZY_MMU); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE #define __HAVE_ARCH_FLUSH_PMD_TLB_RANGEdiff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c index 04d4a8f676db..4da7a847d5f3 100644 --- a/arch/arm64/mm/pageattr.c +++ b/arch/arm64/mm/pageattr.c@@ -293,18 +293,29 @@ int set_direct_map_valid_noflush(struct page *page, unsigned nr, bool valid) } #ifdef CONFIG_DEBUG_PAGEALLOC -/* - * This is - apart from the return value - doing the same - * thing as the new set_direct_map_valid_noflush() function. - * - * Unify? Explain the conceptual differences? - */ void __kernel_map_pages(struct page *page, int numpages, int enable) { + bool lazy_mmu; + if (!can_set_direct_map()) return; + /* + * This is called during page alloc or free, and maybe called while in + * lazy mmu mode. Since set_memory_valid() may also enter lazy mmu mode, + * this would cause nesting which is not supported; the inner call to + * exit the mode would exit, meaning that the outer lazy mmu mode is no + * longer benefiting from the optimization. So temporarily leave lazy + * mmu mode for the duration of the call. + */ + lazy_mmu = arch_in_lazy_mmu_mode(); + if (lazy_mmu) + arch_leave_lazy_mmu_mode(); + set_memory_valid((unsigned long)page_address(page), numpages, enable); + + if (lazy_mmu) + arch_enter_lazy_mmu_mode(); } #endif /* CONFIG_DEBUG_PAGEALLOC */So basically you are flattening the enter/leave_lazy_mmu_mode() regions. Ideally this could have been done by the nesting arch_enter_lazy_mmu_mode() automatically but that means this function returning the current mode and arch_leave_lazy_mmu_mode() taking an argument - more like the irq saving/restoring (even better renaming it to arch_restore_lazy_mmu_mode()). I guess this won't go well with the mm folk who don't seem willing to changes in this area.
We could alternatively use some per-cpu storage for a nest count, but that gets ugly quite quickly I suspect. But regardless, I'm not convinced the semantics of a properly nested lazy_mmu are safe. Thanks, Ryan
FWIW, this patch is correct.