Thread (80 messages) 80 messages, 8 authors, 2025-11-24

Re: [PATCH v4 06/12] mm: introduce generic lazy_mmu helpers

From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2025-11-11 12:16:10
Also in: linux-arm-kernel, linux-mm, lkml, sparclinux, xen-devel

On 11/11/2025 08:01, Alexander Gordeev wrote:
On Mon, Nov 10, 2025 at 09:19:40AM +0000, Ryan Roberts wrote:
quoted
On 10/11/2025 08:11, Alexander Gordeev wrote:
quoted
On Fri, Nov 07, 2025 at 03:22:54PM +0000, Ryan Roberts wrote:

Hi Ryan,
quoted
On 07/11/2025 14:34, David Hildenbrand (Red Hat) wrote:
quoted
quoted
quoted
  #ifndef pte_batch_hint
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 5d2a876035d6..c49b029d3593 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -305,7 +305,7 @@ static int kasan_populate_vmalloc_pte(pte_t *ptep,
unsigned long addr,
      pte_t pte;
      int index;
  -    arch_leave_lazy_mmu_mode();
+    lazy_mmu_mode_pause();
I wonder if there really are use cases that *require* pause/resume? I think
these kasan cases could be correctly implemented using a new nest level instead?
Are there cases where the effects really need to be immediate or do the effects
just need to be visible when you get to where the resume is?

If the latter, that could just be turned into a nested disable (e.g. a flush).
In this case, there is only 1 PTE write so no benefit, but I wonder if other
cases may have more PTE writes that could then still be batched. It would be
nice to simplify the API by removing pause/resume if we can?
It has clear semantics, clearer than some nest-disable IMHO.

Maybe you can elaborate how you would change ("simplify") the API in that
regard? What would the API look like?
By simplify, I just meant can we remove lazy_mmu_mode_pause() and
lazy_mmu_mode_resume() ?


We currently have:

apply_to_page_range
  lazy_mmu_mode_enable()
    kasan_populate_vmalloc_pte()
      lazy_mmu_mode_pause()
      <code>
      lazy_mmu_mode_resume()
  lazy_mmu_mode_disable()

Where <code> is setting ptes. But if <code> doesn't need the effects to be
visible until lazy_mmu_mode_resume(), then you could replace the block with:

apply_to_page_range
  lazy_mmu_mode_enable()
    kasan_populate_vmalloc_pte()
      lazy_mmu_mode_enable()
      <code>
      lazy_mmu_mode_disable()
  lazy_mmu_mode_disable()

However, looking at this more closely, I'm not really clear on why we need *any*
special attention to lazy mmu inside of kasan_populate_vmalloc_pte() and
kasan_depopulate_vmalloc_pte().

I *think* that the original concern was that we were doing ptep_get(ptep) inside
of a lazy_mmu block? So we need to flush so that the getter returns the most
recent value? But given we have never written to that particular ptep while in
the lazy mmu block, there is surely no hazard in the first place?
There is, please see:
https://lore.kernel.org/linux-mm/cover.1755528662.git.agordeev@linux.ibm.com/ (local)
I've stared at this for a while, but I'm afraid I still don't see the problem.
This all looks safe to me. Could you explain exactly what this issue is?

If I've understood correctly, kasan_populate_vmalloc() is called during virtual
range allocation by vmalloc. This is not in a nested lazy mmu block (but it
wouldn't matter if it was once we have Kevin's nested changes to ensure flush
when exiting the nested scope). kasan_populate_vmalloc() calls
apply_to_page_range(), which will walk the set of ptes, calling
kasan_populate_vmalloc_pte() for each one. kasan_populate_vmalloc_pte() does a
ptep_get() then, if none, calls set_pte_at().

That's not a hazard since you're calling get before the set and you only visit
each pte once for the apply_to_page_range() lazy mmu block.
I have to admit I do not remember every detail and would have to recreate
the issue - which is specific to s390 lazy_mmu implementation I think.
Both kasan_populate_vmalloc_pte() and kasan_depopulate_vmalloc_pte() do:

apply_to_page_range()
{
    arch_enter_lazy_mmu_mode();

    kasan_de|populate_vmalloc_pte()
    {
        arch_leave_lazy_mmu_mode();             <--- remove?

        spin_lock(&init_mm.page_table_lock);
        <PTE update>
        spin_unlock(&init_mm.page_table_lock);	<--- PTE store should be done

        arch_enter_lazy_mmu_mode();             <--- remove?
    }

    arch_leave_lazy_mmu_mode();
}

Upon return from spin_unlock() both kasan callbacks expect the PTE contains
an updated value to be stored to pgtable. That is true unless we remove
arch_leave|enter_lazy_mmu_mode() brackets. If we do the value is continued
to be cached and only stored when the outer arch_leave_lazy_mmu_mode() is
called. That results in a race between concurrent PTE updaters.
OK, I've been staring at the code and KASAN docs and believe I understand the
problem now. Thanks for your patience!

The core of the problem is that the shadow memory which is being allocated here
is 1/8th the size of the virtual range it covers, and so a single page of shadow
memory can be shared by multiple vmalloc areas. The implication here is that
multiple concurrent vmalloc() calls can allocate adjacent areas and both will
race to allocate the same shadow page. That's why we have the spin lock and the
check for pte_none(); the winner is the one that sees pte_none() == true and
will perform the mapping.

And so yes, this does indeed create a read hazzard; there are 2 racing threads,
both reading and writing the pte.

One alternative solution would be to grab the spin lock around the whole
apply_to_page_range(), but for the populate call, that would imply holding the
lock during __memset(). And for depopulate, it would imply holding it during
__free_page(). Neither of these are desirable.

So I agree pause/resume are required here. Sorry for the noise!

Thanks,
Ryan

quoted
quoted
quoted
apply_to_existing_page_range() will only call kasan_depopulate_vmalloc_pte()
once per pte, right? So given we read the ptep before writing it, there should
be no hazard? If so we can remove pause/resume.
Unfortunately, we rather not, please see:
https://lore.kernel.org/linux-mm/d407a381-099b-4ec6-a20e-aeff4f3d750f@arm.com/ (local)
Sorry but I don't see anything relavent to my point in this mail. Perhaps there
is some s390-specific detail that I'm failing to understand?
Sorry, with this message I meant the branch where it was discussed,
I will try to C&P some excerpts and summarize it here.

* lazy_mmu_mode_enable()

This helper is parameter-free, assuming the MMU unit does not need any
configuration other than turning it on/off. That is currently true, but
(as I noted in my other mail) I am going to introduce a friend enable
function that accepts parameters, creates an arch-specific state and
uses it while the lazy mmu mode is active:

static inline void arch_enter_lazy_mmu_mode_pte(struct mm_struct *mm,
						unsigned long addr,
						unsigned long end,
						pte_t *ptep)
{
	...
}

* lazy_mmu_mode_resume() -> arch_enter_lazy_mmu_mode()

Conversely, this needs to be -> arch_resume_lazy_mmu_mode(). And it can not
be arch_enter_lazy_mmu_mode(), since a lazy_mmu_mode_resume() caller does
not know the parameters passed to the original lazy_mmu_mode_enable(...)-
friend.
quoted
Thanks,
Ryan
Thanks!
quoted
quoted
The problem is kasan code invokes apply_to_page_range(), which enters lazy_mmu
mode unconditionally. I would claim that is rather an obstacle for the kasan
code, not a benefit. But it needs to be tackled.
quoted
Should apply_to_page_range() had an option not to enter the lazy_mmu mode
(e.g. an extra "bool skip_lazy" parameter) - the pause/resume could have
been avoided.
quoted
Thanks,
Ryan
Thanks!
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help