Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping... | linux-mm

Re: [PATCH v3 11/27] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed

From: Alistair Popple <apopple@nvidia.com>
Date: 2021-07-06 05:40:53
Also in: lkml

quoted

 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..c29a6ef3a642 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h

@@ -4,6 +4,8 @@
 
 #include <linux/huge_mm.h>
 #include <linux/swap.h>
+#include <linux/userfaultfd_k.h>
+#include <linux/swapops.h>
 
 /**
  * page_is_file_lru - should the page be on a file LRU or anon LRU?

@@ -104,4 +106,45 @@ static __always_inline void del_page_from_lru_list(struct page *page,
 	update_lru_size(lruvec, page_lru(page), page_zonenum(page),
 			-thp_nr_pages(page));
 }
+
+/*
+ * If this pte is wr-protected by uffd-wp in any form, arm the special pte to
+ * replace a none pte.  NOTE!  This should only be called when *pte is already
+ * cleared so we will never accidentally replace something valuable.  Meanwhile
+ * none pte also means we are not demoting the pte so if tlb flushed then we
+ * don't need to do it again; otherwise if tlb flush is postponed then it's
+ * even better.
+ *
+ * Must be called with pgtable lock held.
+ */
+static inline void
+pte_install_uffd_wp_if_needed(struct vm_area_struct *vma, unsigned long addr,
+			      pte_t *pte, pte_t pteval)
+{
+#ifdef CONFIG_USERFAULTFD
+	bool arm_uffd_pte = false;
+
+	/* The current status of the pte should be "cleared" before calling */
+	WARN_ON_ONCE(!pte_none(*pte));
+
+	if (vma_is_anonymous(vma))
+		return;
+
+	/* A uffd-wp wr-protected normal pte */
+	if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval)))
+		arm_uffd_pte = true;
+
+	/*
+	 * A uffd-wp wr-protected swap pte.  Note: this should even work for
+	 * pte_swp_uffd_wp_special() too.
+	 */

I'm probably missing something but when can we actually have this case and why
would we want to leave a special pte behind? From what I can tell this is
called from try_to_unmap_one() where this won't be true or from zap_pte_range()
when not skipping swap pages.

Yes this is a good question..

Initially I made this function make sure I cover all forms of uffd-wp bit, that
contains both swap and present ptes; imho that's pretty safe.  However for
!anonymous cases we don't keep swap entry inside pte even if swapped out, as
they should reside in shmem page cache indeed.  The only missing piece seems to
be the device private entries as you also spotted below.

Yes, I think it's *probably* safe although I don't yet have a strong opinion
here ...

quoted

+	if (unlikely(is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)))

... however if this can never happen would a WARN_ON() be better? It would also
mean you could remove arm_uffd_pte.

Hmm, after a second thought I think we can't make it a WARN_ON_ONCE().. this
can still be useful for private mapping of shmem files: in that case we'll have
swap entry stored in pte not page cache, so after page reclaim it will contain
a valid swap entry, while it's still "!anonymous".

There's something (probably obvious) I must still be missing here. During
reclaim won't a private shmem mapping still have a present pteval here?
Therefore it won't trigger this case - the uffd wp bit is set when the swap
entry is established further down in try_to_unmap_one() right?

I agree if it's at the point when it get reclaimed, however what if we zap a
pte of a page already got reclaimed?  It should have the swap pte installed,
imho, which will have "is_swap_pte(pteval) && pte_swp_uffd_wp(pteval)"==true.

Apologies for the delay getting back to this, I hope to find some more time
to look at this again this week.

I guess what I am missing is why we care about a swap pte for a reclaimed page
getting zapped. I thought that would imply the mapping was getting torn down,
although I suppose in that case you still want the uffd-wp to apply in case a
new mapping appears there?

quoted

+		arm_uffd_pte = true;
+
+	if (unlikely(arm_uffd_pte))
+		set_pte_at(vma->vm_mm, addr, pte,
+			   pte_swp_mkuffd_wp_special(vma));
+#endif
+}
+
 #endif

diff --git a/mm/memory.c b/mm/memory.c
index 319552efc782..3453b8ae5f4f 100644
--- a/mm/memory.c
+++ b/mm/memory.c

@@ -73,6 +73,7 @@
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
 #include <linux/vmalloc.h>
+#include <linux/mm_inline.h>
 
 #include <trace/events/kmem.h>

@@ -1298,6 +1299,21 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 	return ret;
 }
 
+/*
+ * This function makes sure that we'll replace the none pte with an uffd-wp
+ * swap special pte marker when necessary. Must be with the pgtable lock held.
+ */
+static inline void
+zap_install_uffd_wp_if_needed(struct vm_area_struct *vma,
+			      unsigned long addr, pte_t *pte,
+			      struct zap_details *details, pte_t pteval)
+{
+	if (zap_drop_file_uffd_wp(details))
+		return;
+
+	pte_install_uffd_wp_if_needed(vma, addr, pte, pteval);
+}
+
 static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,

@@ -1335,6 +1351,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
+			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+						      ptent);
 			if (unlikely(!page))
 				continue;

@@ -1359,6 +1377,22 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			continue;
 		}
 
+		/*
+		 * If this is a special uffd-wp marker pte... Drop it only if
+		 * enforced to do so.
+		 */
+		if (unlikely(is_swap_special_pte(ptent))) {
+			WARN_ON_ONCE(!pte_swp_uffd_wp_special(ptent));

Why the WARN_ON and not just test pte_swp_uffd_wp_special() directly?

quoted

+			/*
+			 * If this is a common unmap of ptes, keep this as is.
+			 * Drop it only if this is a whole-vma destruction.
+			 */
+			if (zap_drop_file_uffd_wp(details))
+				ptep_get_and_clear_full(mm, addr, pte,
+							tlb->fullmm);
+			continue;
+		}
+
 		entry = pte_to_swp_entry(ptent);
 		if (is_device_private_entry(entry) ||
 		    is_device_exclusive_entry(entry)) {

@@ -1373,6 +1407,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 				page_remove_rmap(page, false);
 
 			put_page(page);
+			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
+						      ptent);

Device entries only support anonymous vmas at present so should we drop this?
I guess I'm also a little confused by this because I'm not sure in what
scenarios you would want to zap swap entries but leave special swap ptes behind
(see also my earlier question above as well).

If that's the case, maybe indeed this is not needed, and I can use a
WARN_ON_ONCE here instead, just in case some facts changes. E.g., would it be
possible one day to have !anonymous support for device private entries?
Frankly I have no solid idea on how device private is used, so some more
context would be nice too; since I think you should know much better than me,
so maybe it's a good chance to learn more about it. :)

Yes, a WARN_ON_ONCE() would be good if you remove it. We are planning to add
support for !anonymous device private entries at some point.

There's nothing too special about device private entries. They exist to store
some state and look up a device driver callback that gets called when the CPU
tries to access the page. For example see how do_swap_page() handles them:

                } else if (is_device_private_entry(entry)) {
                        vmf->page = pfn_swap_entry_to_page(entry);
                        ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);

Normally a device driver provides the implementation of migrate_to_ram() which
will copy the page back to CPU addressable memory and restore the PTE to a
normal functioning PTE using the migrate_vma_*() interfaces. Typically this is
used to allow migration of a page to memory that is not directly CPU addressable
(eg. GPU memory). Hopefully that goes some way to explaining what they are, but
if you have more questions let me know!

Thanks for offering these details!  So one thing I'm still uncertain is what
exact type of memory is allowed to be mapped to device private.  E.g., would
"anonymous shared" allowed as "anonymous"?  I saw there seems to have one ioctl
defined that's used to bind these things:

	DRM_IOCTL_DEF_DRV(NOUVEAU_SVM_BIND, nouveau_svmm_bind, DRM_RENDER_ALLOW),

Then nouveau_dmem_migrate_chunk() will initiates the device private entries, am
I right?  Then to ask my previous question in another form: if the vaddr range
is coming from an userspace extention driver, would it be allowed to pass in
some vaddr range mapped with MAP_ANONYMOUS|MAP_SHARED?

I should have been more specific - device private pages currently only support
non-file/shmem backed pages. In other words the migrate_vma_*() calls will fail
for MAP_ANONYMOUS | MAP_SHARED when the target page is a device private page.

For a present page this is enforced in migrate_vma_pages() when trying to
migrate to a device private page:

                mapping = page_mapping(page);

                if (is_zone_device_page(newpage)) {
                        if (is_device_private_page(newpage)) {
                                /*
                                 * For now only support private anonymous when
                                 * migrating to un-addressable device memory.
                                 */
                                if (mapping) {
                                        migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
                                        continue;
                                }

Ah fair enough. :)

When I looked again, I did also see that there's vma_is_anonymous() check right
at the entry of migrate_vma_insert_page() too.

I'll convert this device private call to a WARN_ON_ONCE() then, with proper
comments explaining why.

Thanks,

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help