Thread (79 messages) 79 messages, 8 authors, 2025-09-15

Re: [PATCH v11 00/15] khugepaged: mTHP support

From: Nico Pache <npache@redhat.com>
Date: 2025-09-13 00:29:24
Also in: linux-doc, linux-mm, lkml

On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes
[off-list ref] wrote:
On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:
quoted
On 12.09.25 17:51, Lorenzo Stoakes wrote:
quoted
With all this stuff said, do we have an actual plan for what we intend to do
_now_?
Oh no, no I have to use my brain and it's Friday evening.
I apologise :)
quoted
quoted
As Nico has implemented a basic solution here that we all seem to agree is not
what we want.

Without needing special new hardware or major reworks, what would this parameter
look like?

What would the heuristics be? What about the eagerness scales?

I'm but a simple kernel developer,
:)

and interested in simple pragmatic stuff :)
quoted
do you have a plan right now David?
Ehm, if you ask me that way ...
quoted
Maybe we can start with something simple like a rough percentage per eagerness
entry that then gets scaled based on utilisation?
... I think we should probably:

1) Start with something very simple for mTHP that doesn't lock us into any particular direction.
Yes.
quoted
2) Add an "eagerness" parameter with fixed scale and use that for mTHP as well
Yes I think we're all pretty onboard with that it seems!
quoted
3) Improve that "eagerness" algorithm using a dynamic scale or #whatever
Right, I feel like we could start with some very simple linear thing here and
later maybe refine it?
I agree, something like 0,32,64,128,255,511 seem to map well, and is
not too different from what im doing with the scaling by
(HPAGE_PMD_ORDER - order).
quoted
4) Solve world peace and world hunger
Yes! That would be pretty great ;)
This should probably be a larger priority
quoted
5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whatever
I think these are TODOs :)
quoted

I maintain my initial position that just using

max_ptes_none == 511 -> collapse mTHP always
max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero

As a starting point is probably simple and best, and likely leaves room for any
changes later.
Yes.
quoted

Of course, we could do what Nico is proposing here, as 1) and change it all later.
Right.

But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess
given the 'creep' issue that's sensible.
I dont think thats much different to what david is trying to propose,
given eagerness=9 would be 50%.
at 10 or 511, no matter what, you will only ever collapse to the
largest enabled order.
The difference in my approach is that technically, with PMD disabled,
and 511, you would still need 50% utilization to collapse, which is
not ideal if you always want to collapse to some mTHP size even with 1
page occupied. With davids solution this is solved by never allowing
anything in between 255-511.
quoted
It's just when it comes to documenting all that stuff in patch #15 that I feel like
"alright, we shouldn't be doing it longterm like that, so let's not make anybody
depend on any weird behavior here by over-domenting it".

I mean

"
+To prevent "creeping" behavior where collapses continuously promote to larger
+orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is
+capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact
+that introducing more than half of the pages to be non-zero it will always
+satisfy the eligibility check on the next scan and the region will be collapse.
"

Is just way, way to detailed.

I would just say "The kernel might decide to use a more conservative approach
when collapsing smaller THPs" etc.


Thoughts?
Well I've sort of reviewed oppositely there :) well at least that it needs to be
a hell of a lot clearer (I find that comment really compressed and I just don't
really understand it).
I think your review is still valid to improve the internal code
comment. I think David is suggesting to not be so specific in the
actual admin-guide docs as we move towards a more opaque tunable.
I guess I didn't think about people reading that and relying on it, so maybe we
could alternatively make that succinct.

But I think it'd be better to say something like "mTHP collapse cannot currently
correctly function with half or more of the PTE entries empty, so we cap at just
below this level" in this case.
Some middle ground might be the best answer, not too specific, but
also allude to the interworking a little.

Cheers,
-- Nico
quoted
--
Cheers

David / dhildenb
Cheers, Lorenzo
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help