Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

[PATCH 00/12] mm: free retracted page table by RCU · Hugh Dickins <hughd@google.com> · 2023-05-29
[PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · Jann Horn <jannh@google.com> · 2023-05-31
Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · Hugh Dickins <hughd@google.com> · 2023-06-02
Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s · Jann Horn <jannh@google.com> · 2023-06-02
[PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 02/12] mm/pgtable: add PAE safety to __pte_offset_map() · Matthew Wilcox <willy@infradead.org> · 2023-05-29
[PATCH 03/12] arm: adjust_pte() use pte_offset_map_nolock() · Hugh Dickins <hughd@google.com> · 2023-05-29
[PATCH 04/12] powerpc: assert_pte_locked() use pte_offset_map_nolock() · Hugh Dickins <hughd@google.com> · 2023-05-29
[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Matthew Wilcox <willy@infradead.org> · 2023-05-29
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Gerald Schaefer <gerald.schaefer@linux.ibm.com> · 2023-06-01
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-06-02
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Jason Gunthorpe <jgg@ziepe.ca> · 2023-06-02
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-06-06
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Jason Gunthorpe <jgg@ziepe.ca> · 2023-06-06
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Peter Xu <peterx@redhat.com> · 2023-06-06
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Jason Gunthorpe <jgg@ziepe.ca> · 2023-06-06
Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-06-07
[PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page · Hugh Dickins <hughd@google.com> · 2023-06-06
[PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Hugh Dickins <hughd@google.com> · 2023-06-06
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Jason Gunthorpe <jgg@ziepe.ca> · 2023-06-06
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Hugh Dickins <hughd@google.com> · 2023-06-08
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Gerald Schaefer <gerald.schaefer@linux.ibm.com> · 2023-06-06
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Hugh Dickins <hughd@google.com> · 2023-06-08
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Jason Gunthorpe <jgg@ziepe.ca> · 2023-06-08
Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async() · Gerald Schaefer <gerald.schaefer@linux.ibm.com> · 2023-06-08
[PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 08/12] mm/pgtable: add pte_free_defer() for pgtable as page · Jann Horn <jannh@google.com> · 2023-06-01
[PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock · Peter Xu <peterx@redhat.com> · 2023-05-29
Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock · Hugh Dickins <hughd@google.com> · 2023-05-31
Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock · Jann Horn <jannh@google.com> · 2023-05-31
Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock · Hugh Dickins <hughd@google.com> · 2023-06-06
[PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · Hugh Dickins <hughd@google.com> · 2023-05-29
Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · Jann Horn <jannh@google.com> · 2023-05-31
Re: [PATCH 10/12] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock() · Hugh Dickins <hughd@google.com> · 2023-06-02
[PATCH 11/12] mm/khugepaged: delete khugepaged_collapse_pte_mapped_thps() · Hugh Dickins <hughd@google.com> · 2023-05-29
[PATCH 12/12] mm: delete mmap_write_trylock() and vma_try_start_write() · Hugh Dickins <hughd@google.com> · 2023-05-29

From: Hugh Dickins <hughd@google.com>
Date: 2023-06-02 05:11:39
Also in: linux-arm-kernel, linux-s390, lkml, sparclinux

On Wed, 31 May 2023, Jann Horn wrote:

On Mon, May 29, 2023 at 8:26 AM Hugh Dickins [off-list ref] wrote:

quoted

Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

I think there's a weirdness in the existing code, and this change
probably turns that into a UAF bug.

collapse_pte_mapped_thp() can be called on an address that might not
be associated with a VMA anymore, and after this change, the page
tables for that address might be in the middle of page table teardown
in munmap(), right? The existing mmap_write_lock() guards against
concurrent munmap() (so in the old code we are guaranteed to either
see a normal VMA or not see the page tables anymore), but
mmap_read_lock() only guards against the part of munmap() up to the
mmap_write_downgrade() in do_vmi_align_munmap(), and unmap_region()
(including free_pgtables()) happens after that.

Excellent point, thank you.  Don't let anyone overhear us, but I have
to confess to you that that mmap_write_downgrade() has never impinged
forcefully enough on my consciousness: it's still my habit to think of
mmap_lock as exclusive over free_pgtables(), and I've not encountered
this bug in my testing.

Right, I'll gladly incorporate your collapse_pte_mapped_thp()
rearrangement below.  And am reassured to realize that by removing
mmap_lock dependence elsewhere, I won't have got it wrong in other places.

Thanks,
Hugh

quoted hunk ↗ jump to hunk

So we can now enter collapse_pte_mapped_thp() and race with concurrent
free_pgtables() such that a PUD disappears under us while we're
walking it or something like that:


int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
          bool install_pmd)
{
  struct mmu_notifier_range range;
  unsigned long haddr = addr & HPAGE_PMD_MASK;
  struct vm_area_struct *vma = vma_lookup(mm, haddr); // <<< returns NULL
  struct page *hpage;
  pte_t *start_pte, *pte;
  pmd_t *pmd, pgt_pmd;
  spinlock_t *pml, *ptl;
  int nr_ptes = 0, result = SCAN_FAIL;
  int i;

  mmap_assert_locked(mm);

  /* Fast check before locking page if already PMD-mapped */
  result = find_pmd_or_thp_or_none(mm, haddr, &pmd); // <<< PUD UAF in here
  if (result == SCAN_PMD_MAPPED)
    return result;

  if (!vma || !vma->vm_file || // <<< bailout happens too late
      !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
    return SCAN_VMA_CHECK;


I guess the right fix here is to make sure that at least the basic VMA
revalidation stuff (making sure there still is a VMA covering this
range) happens before find_pmd_or_thp_or_none()? Like:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 301c0e54a2ef..5db365587556 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c

@@ -1481,15 +1481,15 @@ int collapse_pte_mapped_thp(struct mm_struct

*mm, unsigned long addr,

         mmap_assert_locked(mm);

+        if (!vma || !vma->vm_file ||
+            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+                return SCAN_VMA_CHECK;
+
         /* Fast check before locking page if already PMD-mapped */
         result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
         if (result == SCAN_PMD_MAPPED)
                 return result;

-        if (!vma || !vma->vm_file ||
-            !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-                return SCAN_VMA_CHECK;
-
         /*
          * If we are here, we've succeeded in replacing all the native pages
          * in the page cache with a single hugepage. If a mm were to fault-in

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help