Thread (91 messages) 91 messages, 11 authors, 2025-11-26

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

From: Nico Pache <npache@redhat.com>
Date: 2025-10-29 21:10:52
Also in: linux-doc, linux-mm, lkml

On Wed, Oct 29, 2025 at 12:42 PM Lorenzo Stoakes
[off-list ref] wrote:
On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
quoted
quoted
quoted
No creep, because you'll always collapse.
OK so in the 511 scenario, do we simply immediately collapse to the largest
possible _mTHP_ page size if based on adjacent none/zero page entries in the
PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
none/zero PTE entries to do so?
Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
and later, once a PMD is possible, collapse to a PMD.

But there is no creep, as we would have collapsed a PMD right from the start
either way.
Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
ever collapse to PMD, except in cases where, for instance, PTE entries
belong to distinct VMAs and so you have to collapse to mTHP as a result?
There are a few failure cases, like exceeding thresholds, or
allocations failures, but yes your assessment is correct.

At 511, the PMD collapse will be satisfied by a single PTE. If the
collapse fails we will try both sides of the PMD (1024kb , 1024kb).
the one that contains the non-none PTE will collapse

This is where the (HPAGE_PMD_ORDER - order) comes from.
imagine the 511 case above
511 >> HPAGE_PMD_ORDER - 9 == 511 >> 0 = 511 max ptes none
511 >> PMD_ORDER - 8 (1024kb) == 511 >> 1 = 255 max_ptes_none

both of these align to the orders size minus 1.
Or IOW 'always collapse to the largest size you can I don't care if it
takes up more memory'

And at 0, we'd never collapse anything across zero entries, and only when
adjacent present entries can be collapse to mTHP/PMD do we do so?
Yep!

max_pte_none =0 + all mTHP sizes enabled, gives you a really good
distribution of mTHP sizes in the systems, as zero memory will be
wasted and the most optimal size (space wise) will eb found. At least
for the memory allocated through khugepaged. The Defer patchset I had
on top of this series was exactly for that purpose-- Allow khugepaged
to determine all the THP usage in the system (other than madvise), and
allow granular control of memory waste.
quoted
quoted
And only collapse to PMD size if we have sufficient adjacent PTE entries that
are populated?

Let's really nail this down actually so we can be super clear what the issue is
here.
I hope what I wrote above made sense.
Asking some q's still, probably more a me thing :)
quoted
quoted
quoted
Creep only happens if you wouldn't collapse a PMD without prior mTHP
collapse, but suddenly would in the same scenario simply because you had
prior mTHP collapse.

At least that's my understanding.
OK, that makes sense, is the logic (this may be part of the bit I haven't
reviewed yet tbh) then that for khugepaged mTHP we have the system where we
always require prior mTHP collapse _first_?
So I would describe creep as

"we would not collapse a PMD THP because max_ptes_none is violated, but
because we collapsed smaller mTHP THPs before, we essentially suddenly have
more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
at the same place".
Yeah that makes sense.
quoted
Assume the following: max_ptes_none = 256

This means we would only collapse if at most half (256/512) of the PTEs are
none-or-zero.

But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:

[ P Z P Z P Z Z Z ]

3 Present vs. 5 Zero -> do not collapse a PMD (8)
OK I'm thinking this is more about /ratio/ than anything else.

PMD - <=50% - ok 5/8 = 62.5% no collapse.
                < 50%*.

At 50% it's 256 which is actually the worst case scenario. But I read
further, and it seems like you grasped the issue.
quoted
But sssume we collapse smaller mTHP (2 entries) first

[ P P P P P P Z Z ]
...512 KB mTHP (2 entries) - <= 50% means we can do...
quoted
We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
Yes so that's:

[ P Z P Z P Z Z Z ]

->

[ P P P P P P Z Z ]

Right?
quoted
Suddenly we have

6 Present vs 2 Zero and we collapse a PMD (8)

[ P P P P P P P P ]

That's the "creep" problem.
I guess we try PMD collapse first then mTHP, but the worry is another pass
will collapse to PMD right?


Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
this because each collapse never provides enough reduction in zero entries
to allow for higher order collapse.

Hence the idea of capping at 255
Yep! We've discussed other solutions, like tracking collapsed pages,
or the solutions brought up by David. But this seemed like the most
logical to me, as it keeps some of the tunability. I now understand
the concern wasnt so much the capping, but rather the silent nature of
it, and the uAPI expectations surrounding enforcing such a limit (for
both past and future behavioral expectations).
quoted
quoted
quoted
quoted
quoted
max_ptes_none == 0 -> collapse mTHP only if all non-none/zero

And for the intermediate values

(1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
supported yet with other values
It feels a bit much to issue a kernel warning every time somebody twiddles that
value, and it's kind of against user expectation a bit.
pr_warn_once() is what I meant.
Right, but even then it feels a bit extreme, warnings are pretty serious
things. Then again there's precedent for this, and it may be the least worse
solution.

I just picture a cloud provider turning this on with mTHP then getting their
monitoring team reporting some urgent communication about warnings in dmesg :)
I mean, one could make the states mutually, maybe?

Disallow enabling mTHP with max_ptes_none set to unsupported values and the
other way around.

That would probably be cleanest, although the implementation might get a bit
more involved (but it's solvable).

But the concern could be that there are configs that could suddenly break:
someone that set max_ptes_none and enabled mTHP.
Yeah we could always return an error on setting to an unsupported value.

I mean pr_warn() is nasty but maybe necessary.
quoted

I'll note that we could also consider only supporting "max_ptes_none = 511"
(default) to start with.

The nice thing about that value is that it us fully supported with the
underused shrinker, because max_ptes_none=511 -> never shrink.
It feels like = 0 would be useful though?
I personally think the default of 511 is wrong and should be on the
lower end of the scale. The exception being thp=always, where I
believe the kernel should treat it as 511.

But the second part of that would also violate the users max_ptes_none
setting, so it's probably much harder in practice, and also not really
part of this series, just my opinion.

Cheers.
-- Nico
quoted
--
Cheers

David / dhildenb
Thanks, Lorenzo
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help