Re: [PATCH v3 04/14] mm/memremap: add ZONE_DEVICE support for compound pages
From: Dan Williams <hidden>
Date: 2021-07-15 19:48:20
Also in:
linux-doc, nvdimm
On Thu, Jul 15, 2021 at 5:52 AM Joao Martins [off-list ref] wrote:
On 7/15/21 2:08 AM, Dan Williams wrote:quoted
On Wed, Jul 14, 2021 at 12:36 PM Joao Martins [off-list ref] wrote:quoted
Add a new align property for struct dev_pagemap which specifies that as/align/@geometry/Yeap, updated.quoted
quoted
pagemap is composed of a set of compound pages of size @align,s/@align/@geometry/Yeap, updated.quoted
quoted
instead of base pages. When a compound page geometry is requested, all but the first page are initialised as tail pages instead of order-0 pages. For certain ZONE_DEVICE users like device-dax which have a fixed page size, this creates an opportunity to optimize GUP and GUP-fast walkers, treating it the same way as THP or hugetlb pages. Signed-off-by: Joao Martins <redacted> --- include/linux/memremap.h | 17 +++++++++++++++++ mm/memremap.c | 8 ++++++-- mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++++- 3 files changed, 56 insertions(+), 3 deletions(-)diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 119f130ef8f1..e5ab6d4525c1 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h@@ -99,6 +99,10 @@ struct dev_pagemap_ops { * @done: completion for @internal_ref * @type: memory type: see MEMORY_* in memory_hotplug.h * @flags: PGMAP_* flags to specify defailed behavior + * @geometry: structural definition of how the vmemmap metadata is populated. + * A zero or PAGE_SIZE defaults to using base pages as the memmap metadata + * representation. A bigger value but also multiple of PAGE_SIZE will set + * up compound struct pages representative of the requested geometry size. * @ops: method table * @owner: an opaque pointer identifying the entity that manages this * instance. Used by various helpers to make sure that no@@ -114,6 +118,7 @@ struct dev_pagemap { struct completion done; enum memory_type type; unsigned int flags; + unsigned long geometry; const struct dev_pagemap_ops *ops; void *owner; int nr_range;@@ -130,6 +135,18 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) return NULL; } +static inline unsigned long pgmap_geometry(struct dev_pagemap *pgmap) +{ + if (!pgmap || !pgmap->geometry) + return PAGE_SIZE; + return pgmap->geometry; +} + +static inline unsigned long pgmap_pfn_geometry(struct dev_pagemap *pgmap) +{ + return PHYS_PFN(pgmap_geometry(pgmap)); +}Are both needed? Maybe just have ->geometry natively be in nr_pages units directly, because pgmap_pfn_geometry() makes it confusing whether it's a geometry of the pfn or the geometry of the pgmap.I use pgmap_geometry() largelly when we manipulate memmap in sparse-vmemmap code, as we deal with addresses/offsets/subsection-size. While using pgmap_pfn_geometry for code that deals with PFN initialization. For this patch I could remove the confusion. And actually maybe I can just store the pgmap_geometry() value in bytes locally in vmemmap_populate_compound_pages() and we can remove this extra helper.quoted
quoted
+ #ifdef CONFIG_ZONE_DEVICE bool pfn_zone_device_reserved(unsigned long pfn); void *memremap_pages(struct dev_pagemap *pgmap, int nid);diff --git a/mm/memremap.c b/mm/memremap.c index 805d761740c4..ffcb924eb6a5 100644 --- a/mm/memremap.c +++ b/mm/memremap.c@@ -318,8 +318,12 @@ static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params, memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE], PHYS_PFN(range->start), PHYS_PFN(range_len(range)), pgmap); - percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) - - pfn_first(pgmap, range_id)); + if (pgmap_geometry(pgmap) > PAGE_SIZE)This would become if (pgmap_geometry(pgmap) > 1)quoted
+ percpu_ref_get_many(pgmap->ref, (pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)) / pgmap_pfn_geometry(pgmap));...and this would be pgmap_geometry()quoted
+ else + percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id) + - pfn_first(pgmap, range_id)); return 0;Let me adjust accordingly.quoted
quoted
err_add_memory:diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 79f3b38afeca..188cb5f8c308 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c@@ -6597,6 +6597,31 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn, } } +static void __ref memmap_init_compound(struct page *page, unsigned long pfn,I'd feel better if @page was renamed @head... more below:Oh yeah -- definitely more readable.quoted
quoted
+ unsigned long zone_idx, int nid, + struct dev_pagemap *pgmap, + unsigned long nr_pages) +{ + unsigned int order_align = order_base_2(nr_pages); + unsigned long i; + + __SetPageHead(page); + + for (i = 1; i < nr_pages; i++) {The switch of loop styles is jarring. I.e. the switch from memmap_init_zone_device() that is using pfn, end_pfn, and a local 'struct page *' variable to this helper using pfn + i and a mix of helpers (__init_zone_device_page, prep_compound_tail) that have different expectations of head page + tail_idx and current page. I.e. this reads more obviously correct to me, but maybe I'm just in the wrong headspace: for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn);Personally -- and I am dubious given I have been staring at this code -- I find that what I wrote a little better as it follows more what compound page initialization does. Like it's easier for me to read that I am initializing a number of tail pages and a head page (for a known geometry size). Additionally, it's unnecessary (and a tiny ineficient?) to keep doing pfn_to_page(pfn) provided ZONE_DEVICE requires SPARSEMEM_VMEMMAP and so your page pointers are all contiguous and so for any given PFN we can avoid having deref vmemmap vaddrs back and forth. Which is the second reason I pass a page, and iterate over its tails based on a head page pointer. But I was at too minds when writing this, so if the there's no added inefficiency I can rewrite like the above.
I mainly just don't want 2 different styles between memmap_init_zone_device() and this helper. So if the argument is that "it's inefficient to use pfn_to_page() here" then why does the caller use pfn_to_page()? I won't argue too much for one way or the other, I'm still biased towards my rewrite, but whatever you pick just make the style consistent.
quoted
quoted
+ __init_zone_device_page(page + i, pfn + i, zone_idx, + nid, pgmap); + prep_compound_tail(page, i); + + /* + * The first and second tail pages need to + * initialized first, hence the head page is + * prepared last.I'd change this comment to say why rather than restate what can be gleaned from the code. It's actually not clear to me why this order is necessary.So the first tail page stores mapcount_ptr and compound order, and the second tail page stores pincount_ptr. prep_compound_head() does this: set_compound_order(page, order); atomic_set(compound_mapcount_ptr(page), -1); if (hpage_pincount_available(page)) atomic_set(compound_pincount_ptr(page), 0); So we need those tail pages initialized first prior to initializing the head. I can expand the comment above to make it clear why we need first and second tail pages.
Thanks!
quoted
quoted
+ */ + if (i == 2) + prep_compound_head(page, order_align); + } +} + void __ref memmap_init_zone_device(struct zone *zone, unsigned long start_pfn, unsigned long nr_pages,@@ -6605,6 +6630,7 @@ void __ref memmap_init_zone_device(struct zone *zone, unsigned long pfn, end_pfn = start_pfn + nr_pages; struct pglist_data *pgdat = zone->zone_pgdat; struct vmem_altmap *altmap = pgmap_altmap(pgmap); + unsigned int pfns_per_compound = pgmap_pfn_geometry(pgmap); unsigned long zone_idx = zone_idx(zone); unsigned long start = jiffies; int nid = pgdat->node_id;@@ -6622,10 +6648,16 @@ void __ref memmap_init_zone_device(struct zone *zone, nr_pages = end_pfn - start_pfn; } - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + for (pfn = start_pfn; pfn < end_pfn; pfn += pfns_per_compound) { struct page *page = pfn_to_page(pfn); __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); + + if (pfns_per_compound == 1) + continue; + + memmap_init_compound(page, pfn, zone_idx, nid, pgmap, + pfns_per_compound);I otherwise don't see anything broken with this patch, so feel free to include: Reviewed-by: Dan Williams <redacted> ...on the resend with the fixups.Thanks. I will wait whether you still want to retain the tag provided the implied changes fixing the failure you reported.
Yeah, tag is still valid.