Thread (78 messages) 78 messages, 7 authors, 2021-08-25

Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages

From: Dan Williams <hidden>
Date: 2021-07-15 19:48:20
Also in: linux-doc, nvdimm

On Thu, Jul 15, 2021 at 5:52 AM Joao Martins [off-list ref] wrote:


On 7/15/21 2:08 AM, Dan Williams wrote:
quoted
On Wed, Jul 14, 2021 at 12:36 PM Joao Martins [off-list ref] wrote:
quoted
Add a new align property for struct dev_pagemap which specifies that a
s/align/@geometry/
Yeap, updated.
quoted
quoted
pagemap is composed of a set of compound pages of size @align,
s/@align/@geometry/
Yeap, updated.
quoted
quoted
instead of
base pages. When a compound page geometry is requested, all but the first
page are initialised as tail pages instead of order-0 pages.

For certain ZONE_DEVICE users like device-dax which have a fixed page size,
this creates an opportunity to optimize GUP and GUP-fast walkers, treating
it the same way as THP or hugetlb pages.

Signed-off-by: Joao Martins <redacted>
---
 include/linux/memremap.h | 17 +++++++++++++++++
 mm/memremap.c            |  8 ++++++--
 mm/page_alloc.c          | 34 +++++++++++++++++++++++++++++++++-
 3 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 119f130ef8f1..e5ab6d4525c1 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -99,6 +99,10 @@ struct dev_pagemap_ops {
  * @done: completion for @internal_ref
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
+ * @geometry: structural definition of how the vmemmap metadata is populated.
+ *     A zero or PAGE_SIZE defaults to using base pages as the memmap metadata
+ *     representation. A bigger value but also multiple of PAGE_SIZE will set
+ *     up compound struct pages representative of the requested geometry size.
  * @ops: method table
  * @owner: an opaque pointer identifying the entity that manages this
  *     instance.  Used by various helpers to make sure that no
@@ -114,6 +118,7 @@ struct dev_pagemap {
        struct completion done;
        enum memory_type type;
        unsigned int flags;
+       unsigned long geometry;
        const struct dev_pagemap_ops *ops;
        void *owner;
        int nr_range;
@@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
        return NULL;
 }

+static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap)
+{
+       if (!pgmap || !pgmap->geometry)
+               return PAGE_SIZE;
+       return pgmap->geometry;
+}
+
+static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap)
+{
+       return PHYS_PFN(pgmap_geometry(pgmap));
+}
Are both needed? Maybe just have ->geometry natively be in nr_pages
units directly, because pgmap_pfn_geometry() makes it confusing
whether it's a geometry of the pfn or the geometry of the pgmap.
I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we
deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that
deals with PFN initialization. For this patch I could remove the confusion.

And actually maybe I can just store the pgmap_geometry() value in bytes locally in
vmemmap_populate_compound_pages() and we can remove this extra helper.
quoted
quoted
+
 #ifdef CONFIG_ZONE_DEVICE
 bool pfn_zone_device_reserved(unsigned long pfn);
 void *memremap_pages(struct dev_pagemap *pgmap, int nid);
diff --git a/mm/memremap.c b/mm/memremap.c
index 805d761740c4..ffcb924eb6a5 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
        memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
                                PHYS_PFN(range->start),
                                PHYS_PFN(range_len(range)), pgmap);
-       percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
-                       - pfn_first(pgmap, range_id));
+       if (pgmap_geometry(pgmap) > PAGE_SIZE)
This would become

if (pgmap_geometry(pgmap) > 1)
quoted
+               percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id)
+                       - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));
...and this would be pgmap_geometry()
quoted
+       else
+               percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+                               - pfn_first(pgmap, range_id));
        return 0;
Let me adjust accordingly.
quoted
quoted
 err_add_memory:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 79f3b38afeca..188cb5f8c308 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
        }
 }

+static void __ref memmap_init_compound(struct page *page, unsigned long pfn,
I'd feel better if @page was renamed @head... more below:
Oh yeah -- definitely more readable.
quoted
quoted
+                                       unsigned long zone_idx, int nid,
+                                       struct dev_pagemap *pgmap,
+                                       unsigned long nr_pages)
+{
+       unsigned int order_align = order_base_2(nr_pages);
+       unsigned long i;
+
+       __SetPageHead(page);
+
+       for (i = 1; i < nr_pages; i++) {
The switch of loop styles is jarring. I.e. the switch from
memmap_init_zone_device() that is using pfn, end_pfn, and a local
'struct page *' variable to this helper using pfn + i and a mix of
helpers (__init_zone_device_page,  prep_compound_tail) that have
different expectations of head page + tail_idx and current page.

I.e. this reads more obviously correct to me, but maybe I'm just in
the wrong headspace:

        for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
                prep_compound_tail(head, pfn - head_pfn);
Personally -- and I am dubious given I have been staring at this code -- I find that what
I wrote a little better as it follows more what compound page initialization does. Like
it's easier for me to read that I am initializing a number of tail pages and a head page
(for a known geometry size).

Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn)
provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all
contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and
forth. Which is the second reason I pass a page, and iterate over its tails based on a
head page pointer. But I was at too minds when writing this, so if the there's no added
inefficiency I can rewrite like the above.
I mainly just don't want 2 different styles between
memmap_init_zone_device() and this helper. So if the argument is that
"it's inefficient to use pfn_to_page() here" then why does the caller
use pfn_to_page()? I won't argue too much for one way or the other,
I'm still biased towards my rewrite, but whatever you pick just make
the style consistent.
quoted
quoted
+               __init_zone_device_page(page + i, pfn + i, zone_idx,
+                                       nid, pgmap);
+               prep_compound_tail(page, i);
+
+               /*
+                * The first and second tail pages need to
+                * initialized first, hence the head page is
+                * prepared last.
I'd change this comment to say why rather than restate what can be
gleaned from the code. It's actually not clear to me why this order is
necessary.
So the first tail page stores mapcount_ptr and compound order, and the
second tail page stores pincount_ptr. prep_compound_head() does this:

        set_compound_order(page, order);
        atomic_set(compound_mapcount_ptr(page), -1);
        if (hpage_pincount_available(page))
                atomic_set(compound_pincount_ptr(page), 0);

So we need those tail pages initialized first prior to initializing the head.

I can expand the comment above to make it clear why we need first and second tail pages.
Thanks!
quoted
quoted
+                */
+               if (i == 2)
+                       prep_compound_head(page, order_align);
+       }
+}
+
 void __ref memmap_init_zone_device(struct zone *zone,
                                   unsigned long start_pfn,
                                   unsigned long nr_pages,
@@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone,
        unsigned long pfn, end_pfn = start_pfn + nr_pages;
        struct pglist_data *pgdat = zone->zone_pgdat;
        struct vmem_altmap *altmap = pgmap_altmap(pgmap);
+       unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap);
        unsigned long zone_idx = zone_idx(zone);
        unsigned long start = jiffies;
        int nid = pgdat->node_id;
@@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone,
                nr_pages = end_pfn - start_pfn;
        }

-       for (pfn = start_pfn; pfn < end_pfn; pfn++) {
+       for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) {
                struct page *page = pfn_to_page(pfn);

                __init_zone_device_page(page, pfn, zone_idx, nid, pgmap);
+
+               if (pfns_per_compound == 1)
+                       continue;
+
+               memmap_init_compound(page, pfn, zone_idx, nid, pgmap,
+                                    pfns_per_compound);
I otherwise don't see anything broken with this patch, so feel free to include:

Reviewed-by: Dan Williams <redacted>

...on the resend with the fixups.
Thanks.

I will wait whether you still want to retain the tag provided the implied changes
fixing the failure you reported.
Yeah, tag is still valid.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help