Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers
From: Alexander Gordeev <agordeev@linux.ibm.com>
Date: 2025-11-11 08:03:19
Also in:
linux-arm-kernel, linux-mm, lkml, sparclinux, xen-devel
On Mon, Nov 10, 2025 at 09:19:40AM +0000, Ryan Roberts wrote:
On 10/11/2025 08:11, Alexander Gordeev wrote:quoted
On Fri, Nov 07, 2025 at 03:22:54PM +0000, Ryan Roberts wrote: Hi Ryan,quoted
On 07/11/2025 14:34, David Hildenbrand (Red Hat) wrote:quoted
quoted
quoted
#ifndef pte_batch_hintdiff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c index 5d2a876035d6..c49b029d3593 100644 --- a/mm/kasan/shadow.c +++ b/mm/kasan/shadow.c@@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep,unsigned long addr, pte_t pte; int index; - arch_leave_lazy_mmu_mode(); + lazy_mmu_mode_pause();I wonder if there really are use cases that *require* pause/resume? I think these kasan cases could be correctly implemented using a new nest level instead? Are there cases where the effects really need to be immediate or do the effects just need to be visible when you get to where the resume is? If the latter, that could just be turned into a nested disable (e.g. a flush). In this case, there is only 1 PTE write so no benefit, but I wonder if other cases may have more PTE writes that could then still be batched. It would be nice to simplify the API by removing pause/resume if we can?It has clear semantics, clearer than some nest-disable IMHO. Maybe you can elaborate how you would change ("simplify") the API in that regard? What would the API look like?By simplify, I just meant can we remove lazy_mmu_mode_pause() and lazy_mmu_mode_resume() ? We currently have: apply_to_page_range lazy_mmu_mode_enable() kasan_populate_vmalloc_pte() lazy_mmu_mode_pause() <code> lazy_mmu_mode_resume() lazy_mmu_mode_disable() Where <code> is setting ptes. But if <code> doesn't need the effects to be visible until lazy_mmu_mode_resume(), then you could replace the block with: apply_to_page_range lazy_mmu_mode_enable() kasan_populate_vmalloc_pte() lazy_mmu_mode_enable() <code> lazy_mmu_mode_disable() lazy_mmu_mode_disable() However, looking at this more closely, I'm not really clear on why we need *any* special attention to lazy mmu inside of kasan_populate_vmalloc_pte() and kasan_depopulate_vmalloc_pte(). I *think* that the original concern was that we were doing ptep_get(ptep) inside of a lazy_mmu block? So we need to flush so that the getter returns the most recent value? But given we have never written to that particular ptep while in the lazy mmu block, there is surely no hazard in the first place?There is, please see: https://lore.kernel.org/linux-mm/cover.1755528662.git.agordeev@linux.ibm.com/ (local)I've stared at this for a while, but I'm afraid I still don't see the problem. This all looks safe to me. Could you explain exactly what this issue is? If I've understood correctly, kasan_populate_vmalloc() is called during virtual range allocation by vmalloc. This is not in a nested lazy mmu block (but it wouldn't matter if it was once we have Kevin's nested changes to ensure flush when exiting the nested scope). kasan_populate_vmalloc() calls apply_to_page_range(), which will walk the set of ptes, calling kasan_populate_vmalloc_pte() for each one. kasan_populate_vmalloc_pte() does a ptep_get() then, if none, calls set_pte_at(). That's not a hazard since you're calling get before the set and you only visit each pte once for the apply_to_page_range() lazy mmu block.
I have to admit I do not remember every detail and would have to recreate
the issue - which is specific to s390 lazy_mmu implementation I think.
Both kasan_populate_vmalloc_pte() and kasan_depopulate_vmalloc_pte() do:
apply_to_page_range()
{
arch_enter_lazy_mmu_mode();
kasan_de|populate_vmalloc_pte()
{
arch_leave_lazy_mmu_mode(); <--- remove?
spin_lock(&init_mm.page_table_lock);
<PTE update>
spin_unlock(&init_mm.page_table_lock); <--- PTE store should be done
arch_enter_lazy_mmu_mode(); <--- remove?
}
arch_leave_lazy_mmu_mode();
}
Upon return from spin_unlock() both kasan callbacks expect the PTE contains
an updated value to be stored to pgtable. That is true unless we remove
arch_leave|enter_lazy_mmu_mode() brackets. If we do the value is continued
to be cached and only stored when the outer arch_leave_lazy_mmu_mode() is
called. That results in a race between concurrent PTE updaters.
quoted
quoted
apply_to_existing_page_range() will only call kasan_depopulate_vmalloc_pte() once per pte, right? So given we read the ptep before writing it, there should be no hazard? If so we can remove pause/resume.Unfortunately, we rather not, please see: https://lore.kernel.org/linux-mm/d407a381-099b-4ec6-a20e-aeff4f3d750f@arm.com/ (local)Sorry but I don't see anything relavent to my point in this mail. Perhaps there is some s390-specific detail that I'm failing to understand?
Sorry, with this message I meant the branch where it was discussed,
I will try to C&P some excerpts and summarize it here.
* lazy_mmu_mode_enable()
This helper is parameter-free, assuming the MMU unit does not need any
configuration other than turning it on/off. That is currently true, but
(as I noted in my other mail) I am going to introduce a friend enable
function that accepts parameters, creates an arch-specific state and
uses it while the lazy mmu mode is active:
static inline void arch_enter_lazy_mmu_mode_pte(struct mm_struct *mm,
unsigned long addr,
unsigned long end,
pte_t *ptep)
{
...
}
* lazy_mmu_mode_resume() -> arch_enter_lazy_mmu_mode()
Conversely, this needs to be -> arch_resume_lazy_mmu_mode(). And it can not
be arch_enter_lazy_mmu_mode(), since a lazy_mmu_mode_resume() caller does
not know the parameters passed to the original lazy_mmu_mode_enable(...)-
friend.
Thanks, Ryan
Thanks!
quoted
The problem is kasan code invokes apply_to_page_range(), which enters lazy_mmu mode unconditionally. I would claim that is rather an obstacle for the kasan code, not a benefit. But it needs to be tackled.quoted
Should apply_to_page_range() had an option not to enter the lazy_mmu mode(e.g. an extra "bool skip_lazy" parameter) - the pause/resume could have been avoided.quoted
Thanks, RyanThanks!