Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode

[PATCH v5 00/10] userfaultfd: add minor fault handling · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
Re: [PATCH v5 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() · Mike Kravetz <hidden> · 2021-02-12
[PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled · Mike Kravetz <hidden> · 2021-02-12
Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled · Peter Xu <peterx@redhat.com> · 2021-02-12
Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-12
Re: [PATCH v5 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled · Mike Kravetz <hidden> · 2021-02-12
[PATCH v5 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp · Mike Kravetz <hidden> · 2021-02-12
Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp · Peter Xu <peterx@redhat.com> · 2021-02-12
Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp · Mike Kravetz <hidden> · 2021-02-12
Re: [PATCH v5 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp · Peter Xu <peterx@redhat.com> · 2021-02-12
[PATCH v5 05/10] userfaultfd: add minor fault registration mode · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-11
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-11
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Matthew Wilcox <willy@infradead.org> · 2021-02-12
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Peter Xu <peterx@redhat.com> · 2021-02-12
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-12
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Peter Xu <peterx@redhat.com> · 2021-02-12
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Mike Kravetz <hidden> · 2021-02-12
Re: [PATCH v5 05/10] userfaultfd: add minor fault registration mode · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-12
[PATCH v5 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 10/10] userfaultfd/selftests: add test exercising minor fault handling · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 09/10] userfaultfd: update documentation to describe minor fault handling · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
[PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10
Re: [PATCH v5 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl · Axel Rasmussen <axelrasmussen@google.com> · 2021-02-10

From: Axel Rasmussen <axelrasmussen@google.com>
Date: 2021-02-12 19:35:15
Also in: linux-mm, lkml

On Fri, Feb 12, 2021 at 11:18 AM Mike Kravetz [off-list ref] wrote:

On 2/10/21 1:21 PM, Axel Rasmussen wrote:

quoted

This feature allows userspace to intercept "minor" faults. By "minor"
faults, I mean the following situation:

Let there exist two mappings (i.e., VMAs) to the same page(s). One of
the mappings is registered with userfaultfd (in minor mode), and the
other is not. Via the non-UFFD mapping, the underlying pages have
already been allocated & filled with some contents. The UFFD mapping
has not yet been faulted in; when it is touched for the first time,
this results in what I'm calling a "minor" fault. As a concrete
example, when working with hugetlbfs, we have huge_pte_none(), but
find_lock_page() finds an existing page.

Do we want to intercept the fault if it is for a private mapping that
will COW the page in the page cache?  I think 'yes' but just want to
confirm.  The code added to hugetlb_no_page will intercept these COW
accesses.

I can at least say this is intentional, although I admit I don't have
a precise use case in mind for the UFFD mapping being private. I
suppose it's something like, the UFFD poll thread is supposed to
(maybe) update the page contents, *before* I CoW it, and then once
it's been CoW-ed I don't want that poll thread to be able to see
whatever changes I've made?

Unless there's some different use case for this, I believe this is the
behavior we want.

<snip>

quoted

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e41b77cf6cc2..f150b10981a8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c

@@ -4366,6 +4366,38 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
                              VM_FAULT_SET_HINDEX(hstate_index(h));
                      goto backout_unlocked;
              }
+
+             /* Check for page in userfault range. */
+             if (userfaultfd_minor(vma)) {
+                     u32 hash;
+                     struct vm_fault vmf = {
+                             .vma = vma,
+                             .address = haddr,
+                             .flags = flags,
+                             /*
+                              * Hard to debug if it ends up being used by a
+                              * callee that assumes something about the
+                              * other uninitialized fields... same as in
+                              * memory.c
+                              */
+                     };
+
+                     unlock_page(page);
+
+                     /*
+                      * hugetlb_fault_mutex and i_mmap_rwsem must be dropped
+                      * before handling userfault.  Reacquire after handling
+                      * fault to make calling code simpler.
+                      */
+
+                     hash = hugetlb_fault_mutex_hash(mapping, idx);
+                     mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+                     i_mmap_unlock_read(mapping);

After dropping all the locks, we only hold a reference to the page in the
page cache.  I 'think' someone else could hole punch the page and remove it
from the cache.  IIUC, state changing while processing uffd faults is something
that users need to deal with?  Just need to make sure there are no assumptions
in the kernel code.

Yeah, this seems possible. What I'd expect to happen in that case is
something like:

1. hugetlb_no_page() calls into handle_userfault().
2. Someone hole punches the page, removing it from the page cache.
3. The UFFD poll thread gets the fault event, and issues a
UFFDIO_CONTINUE. (Say we instead were going to write an update, and
*then* UFFDIO_CONTINUE: I think the hole punch by another thread could
also happen between those two events.)
4. This calls down into hugetlb_mcopy_atomic_pte, where we try to
find_lock_page(). This returns NULL, so we bail with -EFAULT.
5. Userspace detects and deals with this error - maybe by writing to
the non-UFFD mapping, thereby putting a page back in the page cache,
or by issuing a UFFDIO_COPY or such?

Which, as far as I can see is fine? But, I am by no means an expert
yet so please correct me if this seems problematic. :)

quoted

+                     ret = handle_userfault(&vmf, VM_UFFD_MINOR);
+                     i_mmap_lock_read(mapping);
+                     mutex_lock(&hugetlb_fault_mutex_table[hash]);
+                     goto out;
+             }
      }

      /*

--
Mike Kravetz

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help