Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios
From: Alistair Popple <apopple@nvidia.com>
Date: 2026-01-19 06:00:05
Also in:
amd-gfx, dri-devel, intel-xe, kvm, linux-cxl, linux-mm, lkml, nouveau
On 2026-01-17 at 16:27 +1100, Matthew Brost [off-list ref] wrote...
On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:quoted
On 1/17/26 14:55, Matthew Brost wrote:quoted
On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:quoted
On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:quoted
quoted
I suppose we could be getting say an order-9 folio that was previously used as two order-8 folios? And each of them had their _nr_pages in their headYes, this is a good example. At this point we have idea what previous allocation(s) order(s) were - we could have multiple places in the loop where _nr_pages is populated, thus we have to clear this everywhere.Why? The fact you have to use such a crazy expression to even access _nr_pages strongly says nothing will read it as _nr_pages. Explain each thing: new_page->flags.f &= ~0xffUL; /* Clear possible order, page head */ OK, the tail page flags need to be set right, and prep_compound_page() called later depends on them being zero. ((struct folio *)(new_page - 1))->_nr_pages = 0; Can't see a reason, nothing reads _nr_pages from a random tail page. _nr_pages is the last 8 bytes of struct page so it overlaps memcg_data, which is also not supposed to be read from a tail page?
This is (or was) either a order-0 page, a head page or a tail page, who knows. So it doesn't really matter whether or not _nr_pages or memcg_data are supposed to be read from a tail page or not. What really matters is does any of vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a particular value when called on this page? AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called on an order-0 page, which means it has to be cleared. Although I think it would be far less confusing if it was just written like that rather than the folio math but it achieves the same thing and is technically correct.
quoted
quoted
quoted
new_folio->mapping = NULL; Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;
Not pointless - vm_insert_page() for example expects folio_test_anon() which which won't be the case if p->mapping was previously set to TAIL_MAPPING so it needs to be cleared. migrate_vma_setup() has a similar issue.
quoted
quoted
quoted
new_folio->pgmap = pgmap; /* Also clear compound head */ Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);
No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE struct pages ready for use by the core-mm which means the pgmap needs to be correct.
quoted
quoted
quoted
new_folio->share = 0; /* fsdax only, unused for device private */ Not sure, certainly share isn't read from a tail page..
Yeah, not useful for now because FS DAX isn't using this function. Arguably it should though.
quoted
quoted
quoted
quoted
quoted
quoted
Why can't this use the normal helpers, like memmap_init_compound()?
Because that's not what this function is trying to do - eg. we might not be trying to create a compound page. Although something like memmap_init_zone_device() looks like it would be a good starting point, with the page order being a parameter instead of read from the pgmap.
quoted
quoted
quoted
quoted
quoted
quoted
struct folio *new_folio = page /* First 4 tail pages are part of struct folio */ for (i = 4; i < (1UL << order); i++) { prep_compound_tail(..) } prep_comound_head(page, order) new_folio->_nr_pages = 0 ??I've beat this to death with Alistair, normal helpers do not work here.What do you mean? It already calls prep_compound_page()! The issue seems to be that prep_compound_page() makes assumptions about what values are in flags already? So how about move that page flags mask logic into prep_compound_tail()? I think that would help Vlastimil's concern. That function is already touching most of the cache line so an extra word shouldn't make a performance difference.quoted
An order zero allocation could have _nr_pages set in its page, new_folio->_nr_pages is page + 1 memory.An order zero allocation does not have _nr_pages because it is in page +1 memory that doesn't exist. An order zero allocation might have memcg_data in the same slot, does it need zeroing? If so why not add that to prep_compound_head() ? Also, prep_compound_head() handles order 0 too: if (IS_ENABLED(CONFIG_64BIT) || order > 1) { atomic_set(&folio->_pincount, 0); atomic_set(&folio->_entire_mapcount, -1); } if (order > 1) INIT_LIST_HEAD(&folio->_deferred_list); So some of the problem here looks to be not calling it: if (order) prep_compound_page(page, order); So, remove that if ? Also shouldn't it be moved above the set_page_count/lock_page ?I'm not addressing each comment, some might be valid, others are not.
Hopefully some of my explainations above help.
quoted
quoted
Ok, can I rework this in a follow-up - I will commit to that? Anything we touch here is extremely sensitive to failures - Intel is the primary test vector for any modification to device pages for what I can tell. The fact is that large device pages do not really work without this patch, or prior revs. I’ve spent a lot of time getting large device pages stable — both here and in the initial series, commiting to help in follow on series touch SVM related things.Matthew, I feel your frustration and appreciate your help. For the current state of 6.19, your changes work for me, I added a Reviewed-by to the patch. It affects a small number of drivers and makes them work for zone device folios. I am happy to maintain the changes sent out as a part of zone_device_page_init()
No problem with the above, and FWIW it seems correct. Although I suspect just setting page->memcg_data = 0 would have been far less controversial ;)
+1quoted
We can rework the details in a follow up series, there are many ideas and ways of doing this (Jason, Alistair, Zi have good ideas as well).I agree we can rework this in a follow-up — the core MM is hard, and for valid reasons, but we can all work together on cleaning it up. Mattquoted
quoted
I’m going to miss my merge window with this (RB’d) patch blocked for large device pages. Expect my commitment to helping other vendors to drop if this happens. I’ll maybe just say: that doesn’t work in my CI, try again. Or perhaps we just revert large device pages in 6.19 if we can't get a consensus here as we shouldn't ship a non-functional kernel. Mattquoted
Jason