Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock
From: Hugh Dickins <hughd@google.com>
Date: 2023-06-06 06:19:29
Also in:
linux-arm-kernel, linux-s390, lkml, sparclinux
On Wed, 31 May 2023, Jann Horn wrote:
On Mon, May 29, 2023 at 8:25 AM Hugh Dickins [off-list ref] wrote:quoted
+static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
...
quoted
+ * Note that vma->anon_vma check is racy: it can be set after + * the check, but page locks (with XA_RETRY_ENTRYs in holes) + * prevented establishing new ptes of the page. So we are safe + * to remove page table below, without even checking it's empty.This "we are safe to remove page table below, without even checking it's empty" assumes that the only way to create new anonymous PTEs is to use existing file PTEs, right? What about private shmem VMAs that are registered with userfaultfd as VM_UFFD_MISSING? I think for those, the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without looking at the mapping and its pages (except for checking that the insertion point is before end-of-file), protected only by mmap_lock (shared) and pte_offset_map_lock().
Right, from your comments and Peter's, thank you both, I can see that userfaultfd breaks the usual assumptions here: so I'm putting an if (unlikely(vma->anon_vma || userfaultfd_wp(vma))) check in once we've got the ptlock; with a comment above it to point the blame at uffd, though I gave up on describing all the detail. And deleted this earlier "we are safe" paragraph. You did suggest, in another mail, that perhaps there should be a scan checking all pte_none() when we get the ptlock. I wasn't keen on yet another debug scan for bugs and didn't add that, thinking I was going to add a patch on the end to do so in page_table_check_pte_clear_range(). But when I came to write that patch, found that I'd been misled by its name: it's about checking or adjusting some accounting, not really a suitable place to check for pte_none() at all; so just scrapped it. ...
quoted
- collapse_and_free_pmd(mm, vma, addr, pmd);The old code called collapse_and_free_pmd(), which involves MMU notifier invocation...
...
quoted
+ pml = pmd_lock(mm, pmd); + ptl = pte_lockptr(mm, pmd); + if (ptl != pml) + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); + pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);... while the new code only does pmdp_collapse_flush(), which clears the pmd entry and does a TLB flush, but AFAICS doesn't use MMU notifiers. My understanding is that that's problematic - maybe (?) it is sort of okay with regards to classic MMU notifier users like KVM, but it's probably wrong for IOMMUv2 users, where an IOMMU directly consumes the normal page tables?
Right, I intentionally left out the MMU notifier invocation, knowing that we have already done an MMU notifier invocation when unmapping any PTEs which were mapped: it was necessary for collapse_and_free_pmd() in the collapse_pte_mapped_thp() case, but there was no notifier in this case for many years, and I was glad to be rid of it. However, I now see that you were adding it intentionally even for this case in your f268f6cf875f; and from later comments in this thread, it looks like there is still uncertainty about whether it is needed here, but safer to assume that it is needed: I'll add it back.
(FWIW, last I looked, there also seemed to be some other issues with MMU notifier usage wrt IOMMUv2, see the thread <https://lore.kernel.org/linux-mm/Yzbaf9HW1%2FreKqR8@nvidia.com/ (local)>.)