Re: [PATCH v6 1/5] mm/zone_device: Reinitialize large zone device private folios

From: Alistair Popple <apopple@nvidia.com>
Date: 2026-01-19 06:00:05
Also in: amd-gfx, dri-devel, intel-xe, kvm, linux-cxl, linux-mm, lkml, nouveau

On 2026-01-17 at 16:27 +1100, Matthew Brost [off-list ref] wrote...

On Sat, Jan 17, 2026 at 03:42:16PM +1100, Balbir Singh wrote:

quoted

On 1/17/26 14:55, Matthew Brost wrote:

quoted

On Fri, Jan 16, 2026 at 08:51:14PM -0400, Jason Gunthorpe wrote:

quoted

On Fri, Jan 16, 2026 at 12:31:25PM -0800, Matthew Brost wrote:

quoted

I suppose we could be getting say an order-9 folio that was previously used
as two order-8 folios? And each of them had their _nr_pages in their head

Yes, this is a good example. At this point we have idea what previous
allocation(s) order(s) were - we could have multiple places in the loop
where _nr_pages is populated, thus we have to clear this everywhere.

Why? The fact you have to use such a crazy expression to even access
_nr_pages strongly says nothing will read it as _nr_pages.

Explain each thing:

		new_page->flags.f &= ~0xffUL;	/* Clear possible order, page head */

OK, the tail page flags need to be set right, and prep_compound_page()
called later depends on them being zero.

		((struct folio *)(new_page - 1))->_nr_pages = 0;

Can't see a reason, nothing reads _nr_pages from a random tail
page. _nr_pages is the last 8 bytes of struct page so it overlaps
memcg_data, which is also not supposed to be read from a tail page?

This is (or was) either a order-0 page, a head page or a tail page, who
knows. So it doesn't really matter whether or not _nr_pages or memcg_data are
supposed to be read from a tail page or not. What really matters is does any of
vm_insert_page(), migrate_vma_*() or prep_compound_page() expect this to be a
particular value when called on this page?

AFAIK memcg_data is at least expected to be NULL for migrate_vma_*() when called
on an order-0 page, which means it has to be cleared.

Although I think it would be far less confusing if it was just written like that
rather than the folio math but it achieves the same thing and is technically
correct.

quoted

		new_folio->mapping = NULL;

Pointless, prep_compound_page() -> prep_compound_tail() -> p->mapping = TAIL_MAPPING;

Not pointless - vm_insert_page() for example expects folio_test_anon() which
which won't be the case if p->mapping was previously set to TAIL_MAPPING so it
needs to be cleared. migrate_vma_setup() has a similar issue.

quoted

		new_folio->pgmap = pgmap;	/* Also clear compound head */

Pointless, compound_head is set in prep_compound_tail(): set_compound_head(p, head);

No it isn't - we're not clearing tail pages here, we're initialising ZONE_DEVICE
struct pages ready for use by the core-mm which means the pgmap needs to be
correct.

quoted

		new_folio->share = 0;   /* fsdax only, unused for device private */

Not sure, certainly share isn't read from a tail page..

Yeah, not useful for now because FS DAX isn't using this function. Arguably it
should though.

quoted
quoted
quoted
quoted
quoted
quoted
Why can't this use the normal helpers, like memmap_init_compound()?

Because that's not what this function is trying to do - eg. we might not be
trying to create a compound page. Although something like
memmap_init_zone_device() looks like it would be a good starting point, with the
page order being a parameter instead of read from the pgmap.

quoted

 struct folio *new_folio = page

 /* First 4 tail pages are part of struct folio */
 for (i = 4; i < (1UL << order); i++) {
     prep_compound_tail(..)
 }

 prep_comound_head(page, order)
 new_folio->_nr_pages = 0

??

I've beat this to death with Alistair, normal helpers do not work here.

What do you mean? It already calls prep_compound_page()! The issue
seems to be that prep_compound_page() makes assumptions about what
values are in flags already?

So how about move that page flags mask logic into
prep_compound_tail()? I think that would help Vlastimil's
concern. That function is already touching most of the cache line so
an extra word shouldn't make a performance difference.

quoted

An order zero allocation could have _nr_pages set in its page,
new_folio->_nr_pages is page + 1 memory.

An order zero allocation does not have _nr_pages because it is in page
+1 memory that doesn't exist.

An order zero allocation might have memcg_data in the same slot, does
it need zeroing? If so why not add that to prep_compound_head() ?

Also, prep_compound_head() handles order 0 too:

	if (IS_ENABLED(CONFIG_64BIT) || order > 1) {
		atomic_set(&folio->_pincount, 0);
		atomic_set(&folio->_entire_mapcount, -1);
	}
	if (order > 1)
		INIT_LIST_HEAD(&folio->_deferred_list);

So some of the problem here looks to be not calling it:

	if (order)
		prep_compound_page(page, order);

So, remove that if ? Also shouldn't it be moved above the
set_page_count/lock_page ?

I'm not addressing each comment, some might be valid, others are not.

Hopefully some of my explainations above help.

quoted

Ok, can I rework this in a follow-up - I will commit to that? Anything
we touch here is extremely sensitive to failures - Intel is the primary
test vector for any modification to device pages for what I can tell.

The fact is that large device pages do not really work without this
patch, or prior revs. I’ve spent a lot of time getting large device
pages stable — both here and in the initial series, commiting to help in
follow on series touch SVM related things.

Matthew, I feel your frustration and appreciate your help.
For the current state of 6.19, your changes work for me, I added a
Reviewed-by to the patch. It affects a small number of drivers and makes
them work for zone device folios. I am happy to maintain the changes
sent out as a part of zone_device_page_init()

No problem with the above, and FWIW it seems correct. Although I suspect just
setting page->memcg_data = 0 would have been far less controversial ;)

+1

quoted

We can rework the details in a follow up series, there are many ideas
and ways of doing this (Jason, Alistair, Zi have good ideas as well).

I agree we can rework this in a follow-up — the core MM is hard, and for
valid reasons, but we can all work together on cleaning it up.

Matt

quoted

I’m going to miss my merge window with this (RB’d) patch blocked for
large device pages. Expect my commitment to helping other vendors to
drop if this happens. I’ll maybe just say: that doesn’t work in my CI,
try again.

Or perhaps we just revert large device pages in 6.19 if we can't get a
consensus here as we shouldn't ship a non-functional kernel.

Matt

quoted

Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help