Thread (78 messages) 78 messages, 7 authors, 2021-08-25

Re: [PATCH v3 12/14] device-dax: compound pagemap support

From: Joao Martins <hidden>
Date: 2021-07-15 12:01:14
Also in: linux-mm, nvdimm
Subsystem: memory hot(un)plug, memory management, memory management - page allocator, the rest · Maintainers: David Hildenbrand, Oscar Salvador, Andrew Morton, Vlastimil Babka, Linus Torvalds

On 7/15/21 12:36 AM, Dan Williams wrote:
On Wed, Jul 14, 2021 at 12:36 PM Joao Martins [off-list ref] wrote:
quoted
Use the newly added compound pagemap facility which maps the assigned dax
ranges as compound pages at a page size of @align. Currently, this means,
that region/namespace bootstrap would take considerably less, given that
you would initialize considerably less pages.

On setups with 128G NVDIMMs the initialization with DRAM stored struct
pages improves from ~268-358 ms to ~78-100 ms with 2M pages, and to less
than a 1msec with 1G pages.

dax devices are created with a fixed @align (huge page size) which is
enforced through as well at mmap() of the device. Faults, consequently
happen too at the specified @align specified at the creation, and those
don't change through out dax device lifetime. MCEs poisons a whole dax
huge page, as well as splits occurring at the configured page size.
Hi Joao,

With this patch I'm hitting the following with the 'device-dax' test [1].
Ugh, I can reproduce it too -- apologies for the oversight.

This patch is not the culprit, the flaw is early in the series, specifically the fourth patch.

It needs this chunk below change on the fourth patch due to the existing elevated page ref
count at zone device memmap init. put_page() called here in memunmap_pages():

for (i = 0; i < pgmap->nr_ranges; i++)
	for_each_device_pfn(pfn, pgmap, i)
		put_page(pfn_to_page(pfn));

... on a zone_device compound memmap would otherwise always decrease head page refcount by
@geometry pfn amount (leading to the aforementioned splat you reported).
diff --git a/mm/memremap.c b/mm/memremap.c
index b0e7b8cf3047..79a883af788e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -102,15 +102,15 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
        return (range->start + range_len(range)) >> PAGE_SHIFT;
 }

-static unsigned long pfn_next(unsigned long pfn)
+static unsigned long pfn_next(struct dev_pagemap *pgmap, unsigned long pfn)
 {
        if (pfn % 1024 == 0)
                cond_resched();
-       return pfn + 1;
+       return pfn + pgmap_pfn_geometry(pgmap);
 }

 #define for_each_device_pfn(pfn, map, i) \
-       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
+       for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(map, pfn))

 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
It could also get this hunk below, but it is sort of redundant provided we won't touch
tail page refcount through out the devmap pages lifetime. This setting of tail pages
refcount to zero was in pre-v5.14 series, but it got removed under the assumption it comes
from the page allocator (where tail pages are already zeroed in refcount).
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 96975edac0a8..469a7aa5cf38 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6623,6 +6623,7 @@ static void __ref memmap_init_compound(struct page *page, unsigned
long pfn,
                __init_zone_device_page(page + i, pfn + i, zone_idx,
                                        nid, pgmap);
                prep_compound_tail(page, i);
+               set_page_count(page + i, 0);

                /*
                 * The first and second tail pages need to
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help