Thread (89 messages) 89 messages, 12 authors, 2025-09-11

Re: [PATCH v10 00/13] khugepaged: mTHP support

From: Liam R. Howlett <hidden>
Date: 2025-08-21 16:39:25
Also in: linux-doc, linux-mm, lkml

* Dev Jain [off-list ref] [250821 11:14]:
On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:
quoted
OK so I noticed in patch 13/13 (!) where you change the documentation that you
essentially state that the whole method used to determine the ratio of PTEs to
collapse to mTHP is broken:

	khugepaged uses max_ptes_none scaled to the order of the enabled
	mTHP size to determine collapses. When using mTHPs it's recommended
	to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255
	on 4k page size). This will prevent undesired "creep" behavior that
	leads to continuously collapsing to the largest mTHP size; when we
	collapse, we are bringing in new non-zero pages that will, on a
	subsequent scan, cause the max_ptes_none check of the +1 order to
	always be satisfied. By limiting this to less than half the current
	order, we make sure we don't cause this feedback
	loop. max_ptes_shared and max_ptes_swap have no effect when
	collapsing to a mTHP, and mTHP collapse will fail on shared or
	swapped out pages.

This seems to me to suggest that using
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means
of establishing a 'ratio' to do this calculation is fundamentally flawed.

So surely we ought to introduce a new sysfs tunable for this? Perhaps

/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio

Or something like this?

It's already questionable that we are taking a value that is expressed
essentially in terms of PTE entries per PMD and then use it implicitly to
determine the ratio for mTHP, but to then say 'oh but the default value is
known-broken' is just a blocker for the series in my opinion.

This really has to be done a different way I think.

Cheers, Lorenzo
FWIW this was my version of the documentation patch:
https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ (local)

The discussion about the creep problem started here:
https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ (local)

and the discussion continuing here:
https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ (local)

ending with a summary I gave here:
https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ (local)

This should help you with the context.
Thanks for hunting this down, the context should be referenced in the
change log so we can find it easier in the future (and now).  Or at
least in the cover letter.

The way the change log in the cover letter is written makes it
exceedingly long.  Could you switch to listing the changes from v9 and
links to v1-8 (+RFCs if there are any)?  Well, I guess include v10
changes and v1-9 urls..

At the length it is now, it's most likely a tl;dr for most.  If you're
starting to review this at v10, then you'd probably appreciate not
rehashing discussions and if you're going from v9 then you already have
an idea of what v10 should have changed.

Said another way, the changelog is more useful with context and context
is difficult to find without a lore link.

I am having issues tracking down the contexts of many items of what has
been generated here and it'll only get worse as time moves on.  We do
our best to keep change logs with the necessary details, but having
bread crumbs to follow is extremely helpful for review and in the long
run.

Thanks,
Liam
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help