[PATCH 00/12] fs/dax: Fix FS DAX page reference counts · Alistair Popple <apopple@nvidia.com> · 2024-09-10
[PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 01/12] mm/gup.c: Remove redundant check for PCI P2PDMA page · Dan Williams <hidden> · 2024-09-22
[PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Bjorn Helgaas <helgaas@kernel.org> · 2024-09-10
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Alistair Popple <apopple@nvidia.com> · 2024-09-11
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Bjorn Helgaas <helgaas@kernel.org> · 2024-09-11
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Logan Gunthorpe <logang@deltatee.com> · 2024-09-11
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Alistair Popple <apopple@nvidia.com> · 2024-10-11
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Dan Williams <hidden> · 2024-09-22
Re: [PATCH 02/12] pci/p2pdma: Don't initialise page refcount to one · Alistair Popple <apopple@nvidia.com> · 2024-10-11
[PATCH 03/12] fs/dax: Refactor wait for dax idle page · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 03/12] fs/dax: Refactor wait for dax idle page · Dan Williams <hidden> · 2024-09-22
[PATCH 04/12] mm: Allow compound zone device pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 04/12] mm: Allow compound zone device pages · Matthew Wilcox <willy@infradead.org> · 2024-09-10
Re: [PATCH 04/12] mm: Allow compound zone device pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 04/12] mm: Allow compound zone device pages · Matthew Wilcox <willy@infradead.org> · 2024-09-10
Re: [PATCH 04/12] mm: Allow compound zone device pages · kernel test robot <hidden> · 2024-09-12
Re: [PATCH 04/12] mm: Allow compound zone device pages · kernel test robot <hidden> · 2024-09-12
Re: [PATCH 04/12] mm: Allow compound zone device pages · Dan Williams <hidden> · 2024-09-22
[PATCH 05/12] mm/memory: Add dax_insert_pfn · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn · Dan Williams <hidden> · 2024-09-22
Re: [PATCH 05/12] mm/memory: Add dax_insert_pfn · Gerald Schaefer <gerald.schaefer@linux.ibm.com> · 2024-10-01
[PATCH 06/12] huge_memory: Allow mappings of PUD sized pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages · Dan Williams <hidden> · 2024-09-22
Re: [PATCH 06/12] huge_memory: Allow mappings of PUD sized pages · Alistair Popple <apopple@nvidia.com> · 2024-10-14
[PATCH 07/12] huge_memory: Allow mappings of PMD sized pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages · Dan Williams <hidden> · 2024-09-27
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages · Alistair Popple <apopple@nvidia.com> · 2024-10-14
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages · Alistair Popple <apopple@nvidia.com> · 2024-10-23
Re: [PATCH 07/12] huge_memory: Allow mappings of PMD sized pages · Dan Williams <hidden> · 2024-10-23
[PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages · Dan Williams <hidden> · 2024-09-25
Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages · Dan Williams <hidden> · 2024-09-27
Re: [PATCH 08/12] gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages · Alistair Popple <apopple@nvidia.com> · 2024-10-14
[PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages · Dan Williams <hidden> · 2024-09-27
Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages · Alistair Popple <apopple@nvidia.com> · 2024-10-14
[PATCH 10/12] fs/dax: Properly refcount fs dax pages · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Dan Williams <hidden> · 2024-09-27
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Alistair Popple <apopple@nvidia.com> · 2024-10-24
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Dan Williams <hidden> · 2024-10-24
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Alistair Popple <apopple@nvidia.com> · 2024-10-25
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Dan Williams <hidden> · 2024-10-25
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Alistair Popple <apopple@nvidia.com> · 2024-10-28
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Dan Williams <hidden> · 2024-10-29
Re: [PATCH 10/12] fs/dax: Properly refcount fs dax pages · Alistair Popple <apopple@nvidia.com> · 2024-10-30
[PATCH 11/12] mm: Remove pXX_devmap callers · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 11/12] mm: Remove pXX_devmap callers · Alexander Gordeev <agordeev@linux.ibm.com> · 2024-09-27
Re: [PATCH 11/12] mm: Remove pXX_devmap callers · Alistair Popple <apopple@nvidia.com> · 2024-10-14
[PATCH 12/12] mm: Remove devmap related functions and page table bits · Alistair Popple <apopple@nvidia.com> · 2024-09-10
Re: [PATCH 12/12] mm: Remove devmap related functions and page table bits · Chunyan Zhang <zhang.lyra@gmail.com> · 2024-09-11
Re: [PATCH 12/12] mm: Remove devmap related functions and page table bits · kernel test robot <hidden> · 2024-09-12

Re: [PATCH 09/12] mm: Update vm_normal_page() callers to accept FS DAX pages

From: Dan Williams <hidden>
Date: 2024-09-27 07:16:03
Also in: linux-arm-kernel, linux-cxl, linux-doc, linux-ext4, linux-fsdevel, linux-mm, linux-xfs, lkml, nvdimm
Subsystem: control group - memory resource controller (memcg), memory management, the rest · Maintainers: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt, Andrew Morton, Linus Torvalds

Alistair Popple wrote:

Currently if a PTE points to a FS DAX page vm_normal_page() will
return NULL as these have their own special refcounting scheme. A
future change will allow FS DAX pages to be refcounted the same as any
other normal page.

Therefore vm_normal_page() will start returning FS DAX pages. To avoid
any change in behaviour callers that don't expect FS DAX pages will
need to explicitly check for this. As vm_normal_page() can already
return ZONE_DEVICE pages most callers already include a check for any
ZONE_DEVICE page.

However some callers don't, so add explicit checks where required.

I would expect justification for each of these conversions, and
hopefully with fsdax returning fully formed folios there is less need to
sprinkle these checks around.

At a minimum I think this patch needs to be broken up by file touched.

quoted hunk ↗ jump to hunk

Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
 arch/x86/mm/pat/memtype.c |  4 +++-
 fs/proc/task_mmu.c        | 16 ++++++++++++----
 mm/memcontrol-v1.c        |  2 +-
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 1fa0bf6..eb84593 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c

@@ -951,6 +951,7 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 		resource_size_t *phys)
 {
+	struct folio *folio;
 	pte_t *ptep, pte;
 	spinlock_t *ptl;

@@ -960,7 +961,8 @@ static int follow_phys(struct vm_area_struct *vma, unsigned long *prot,
 	pte = ptep_get(ptep);
 
 	/* Never return PFNs of anon folios in COW mappings. */
-	if (vm_normal_folio(vma, vma->vm_start, pte)) {
+	folio = vm_normal_folio(vma, vma->vm_start, pte);
+	if (folio || (folio && !folio_is_device_dax(folio))) {

...for example, I do not immediately see why follow_phys() would need to
be careful with fsdax pages?

...but I do see why copy_page_range() (which calls follow_phys() through
track_pfn_copy()) might care. It just turns out that vma_needs_copy(),
afaics, bypasses dax MAP_SHARED mappings.

So this touch of memtype.c looks like it can be dropped.

quoted hunk ↗ jump to hunk

 		pte_unmap_unlock(ptep, ptl);
 		return -EINVAL;
 	}

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5f171ad..456b010 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c

@@ -816,6 +816,8 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 
 	if (pte_present(ptent)) {
 		page = vm_normal_page(vma, addr, ptent);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		young = pte_young(ptent);
 		dirty = pte_dirty(ptent);
 		present = true;

@@ -864,6 +866,8 @@ static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
 
 	if (pmd_present(*pmd)) {
 		page = vm_normal_page_pmd(vma, addr, *pmd);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		present = true;
 	} else if (unlikely(thp_migration_supported() && is_swap_pmd(*pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);

The above can be replaced with a catch like

   if (folio_test_device(folio))
	return;

...in smaps_account() since ZONE_DEVICE pages are not suitable to
account as they do not reflect any memory pressure on the system memory
pool.

quoted hunk ↗ jump to hunk

@@ -1385,7 +1389,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
 	if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
 		return false;
 	folio = vm_normal_folio(vma, addr, pte);
-	if (!folio)
+	if (!folio || folio_is_device_dax(folio))
 		return false;
 	return folio_maybe_dma_pinned(folio);

The whole point of ZONE_DEVICE is to account for DMA so I see no reason
for pte_is_pinned() to special case dax. The caller of pte_is_pinned()
is doing it for soft_dirty reasons, and I believe soft_dirty is already
disabled for vma_is_dax(). I assume MEMORY_DEVICE_PRIVATE also does not
support soft-dirty, so I expect all ZONE_DEVICE already opt-out of this.

quoted hunk ↗ jump to hunk

@@ -1710,6 +1714,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 			frame = pte_pfn(pte);
 		flags |= PM_PRESENT;
 		page = vm_normal_page(vma, addr, pte);
+		if (page && is_device_dax_page(page))
+			page = NULL;
 		if (pte_soft_dirty(pte))
 			flags |= PM_SOFT_DIRTY;
 		if (pte_uffd_wp(pte))
@@ -2096,7 +2102,8 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page(vma, addr, pte);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2158,7 +2165,8 @@ static unsigned long pagemap_thp_category(struct pagemap_scan_private *p,
 
 		if (p->masks_of_interest & PAGE_IS_FILE) {
 			page = vm_normal_page_pmd(vma, addr, pmd);
-			if (page && !PageAnon(page))
+			if (page && !PageAnon(page) &&
+			    !is_device_dax_page(page))
 				categories |= PAGE_IS_FILE;
 		}
 
@@ -2919,7 +2927,7 @@ static struct page *can_gather_numa_stats_pmd(pmd_t pmd,
 		return NULL;
 
 	page = vm_normal_page_pmd(vma, addr, pmd);
-	if (!page)
+	if (!page || is_device_dax_page(page))
 		return NULL;

I am not immediately seeing a reason to block pagemap_read() from
interrogating dax-backed virtual mappings. I think these protections can
be dropped.

quoted hunk ↗ jump to hunk

 
 	if (PageReserved(page))

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index b37c0d8..e16053c 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c

@@ -667,7 +667,7 @@ static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
 {
 	struct page *page = vm_normal_page(vma, addr, ptent);
 
-	if (!page)
+	if (!page || is_device_dax_page(page))
 		return NULL;
 	if (PageAnon(page)) {
 		if (!(mc.flags & MOVE_ANON))

I think this better handled with something like this to disable all
memcg accounting for ZONE_DEVICE pages:

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index b37c0d870816..cfc43e8c59fe 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c

@@ -940,8 +940,7 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
                 */
                if (folio_memcg(folio) == mc.from) {
                        ret = MC_TARGET_PAGE;
-                       if (folio_is_device_private(folio) ||
-                           folio_is_device_coherent(folio))
+                       if (folio_is_device(folio))
                                ret = MC_TARGET_DEVICE;
                        if (target)
                                target->folio = folio;

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help