Re: [PATCH v11 00/15] khugepaged: mTHP support
From: Nico Pache <npache@redhat.com>
Date: 2025-09-13 00:29:24
Also in:
linux-doc, linux-mm, lkml
On Fri, Sep 12, 2025 at 12:22 PM Lorenzo Stoakes [off-list ref] wrote:
On Fri, Sep 12, 2025 at 07:53:22PM +0200, David Hildenbrand wrote:quoted
On 12.09.25 17:51, Lorenzo Stoakes wrote:quoted
With all this stuff said, do we have an actual plan for what we intend to do _now_?Oh no, no I have to use my brain and it's Friday evening.I apologise :)quoted
quoted
As Nico has implemented a basic solution here that we all seem to agree is not what we want. Without needing special new hardware or major reworks, what would this parameter look like? What would the heuristics be? What about the eagerness scales? I'm but a simple kernel developer,:) and interested in simple pragmatic stuff :)quoted
do you have a plan right now David?Ehm, if you ask me that way ...quoted
Maybe we can start with something simple like a rough percentage per eagerness entry that then gets scaled based on utilisation?... I think we should probably: 1) Start with something very simple for mTHP that doesn't lock us into any particular direction.Yes.quoted
2) Add an "eagerness" parameter with fixed scale and use that for mTHP as wellYes I think we're all pretty onboard with that it seems!quoted
3) Improve that "eagerness" algorithm using a dynamic scale or #whateverRight, I feel like we could start with some very simple linear thing here and later maybe refine it?
I agree, something like 0,32,64,128,255,511 seem to map well, and is not too different from what im doing with the scaling by (HPAGE_PMD_ORDER - order).
quoted
4) Solve world peace and world hungerYes! That would be pretty great ;)
This should probably be a larger priority
quoted
5) Connect it all to memory pressure / reclaim / shrinker / heuristics / hw hotness / #whateverI think these are TODOs :)quoted
I maintain my initial position that just using max_ptes_none == 511 -> collapse mTHP always max_ptes_none != 511 -> collapse mTHP only if we all PTEs are non-none/zero As a starting point is probably simple and best, and likely leaves room for any changes later.Yes.quoted
Of course, we could do what Nico is proposing here, as 1) and change it all later.Right. But that does mean for mTHP we're limited to 256 (or 255 was it?) but I guess given the 'creep' issue that's sensible.
I dont think thats much different to what david is trying to propose, given eagerness=9 would be 50%. at 10 or 511, no matter what, you will only ever collapse to the largest enabled order. The difference in my approach is that technically, with PMD disabled, and 511, you would still need 50% utilization to collapse, which is not ideal if you always want to collapse to some mTHP size even with 1 page occupied. With davids solution this is solved by never allowing anything in between 255-511.
quoted
It's just when it comes to documenting all that stuff in patch #15 that I feel like "alright, we shouldn't be doing it longterm like that, so let's not make anybody depend on any weird behavior here by over-domenting it". I mean " +To prevent "creeping" behavior where collapses continuously promote to larger +orders, if max_ptes_none >= HPAGE_PMD_NR/2 (255 on 4K page size), it is +capped to HPAGE_PMD_NR/2 - 1 for mTHP collapses. This is due to the fact +that introducing more than half of the pages to be non-zero it will always +satisfy the eligibility check on the next scan and the region will be collapse. " Is just way, way to detailed. I would just say "The kernel might decide to use a more conservative approach when collapsing smaller THPs" etc. Thoughts?Well I've sort of reviewed oppositely there :) well at least that it needs to be a hell of a lot clearer (I find that comment really compressed and I just don't really understand it).
I think your review is still valid to improve the internal code comment. I think David is suggesting to not be so specific in the actual admin-guide docs as we move towards a more opaque tunable.
I guess I didn't think about people reading that and relying on it, so maybe we could alternatively make that succinct. But I think it'd be better to say something like "mTHP collapse cannot currently correctly function with half or more of the PTE entries empty, so we cap at just below this level" in this case.
Some middle ground might be the best answer, not too specific, but also allude to the interworking a little. Cheers, -- Nico
quoted
-- Cheers David / dhildenbCheers, Lorenzo