Thread (42 messages) 42 messages, 6 authors, 2023-06-08

Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

From: Hugh Dickins <hughd@google.com>
Date: 2023-06-06 06:19:29
Also in: linux-arm-kernel, linux-s390, lkml, sparclinux

On Wed, 31 May 2023, Jann Horn wrote:
On Mon, May 29, 2023 at 8:25 AM Hugh Dickins [off-list ref] wrote:
quoted
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
...
quoted
+                * Note that vma->anon_vma check is racy: it can be set after
+                * the check, but page locks (with XA_RETRY_ENTRYs in holes)
+                * prevented establishing new ptes of the page. So we are safe
+                * to remove page table below, without even checking it's empty.
This "we are safe to remove page table below, without even checking
it's empty" assumes that the only way to create new anonymous PTEs is
to use existing file PTEs, right? What about private shmem VMAs that
are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
looking at the mapping and its pages (except for checking that the
insertion point is before end-of-file), protected only by mmap_lock
(shared) and pte_offset_map_lock().
Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
		if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
quoted
-                       collapse_and_free_pmd(mm, vma, addr, pmd);
The old code called collapse_and_free_pmd(), which involves MMU
notifier invocation...
...
quoted
+               pml = pmd_lock(mm, pmd);
+               ptl = pte_lockptr(mm, pmd);
+               if (ptl != pml)
+                       spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+               pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
... while the new code only does pmdp_collapse_flush(), which clears
the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
notifiers. My understanding is that that's problematic - maybe (?) it
is sort of okay with regards to classic MMU notifier users like KVM,
but it's probably wrong for IOMMUv2 users, where an IOMMU directly
consumes the normal page tables?
Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.
(FWIW, last I looked, there also seemed to be some other issues with
MMU notifier usage wrt IOMMUv2, see the thread
<https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/ (local)>.)
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help