Re: [PATCH v10 00/13] khugepaged: mTHP support
From: Nico Pache <npache@redhat.com>
Date: 2025-08-21 15:27:49
Also in:
linux-doc, linux-mm, lkml
On Thu, Aug 21, 2025 at 9:25 AM Nico Pache [off-list ref] wrote:
On Thu, Aug 21, 2025 at 9:20 AM Lorenzo Stoakes [off-list ref] wrote:quoted
On Thu, Aug 21, 2025 at 08:43:18PM +0530, Dev Jain wrote:quoted
On 21/08/25 8:31 pm, Lorenzo Stoakes wrote:quoted
OK so I noticed in patch 13/13 (!) where you change the documentation that you essentially state that the whole method used to determine the ratio of PTEs to collapse to mTHP is broken: khugepaged uses max_ptes_none scaled to the order of the enabled mTHP size to determine collapses. When using mTHPs it's recommended to set max_ptes_none low-- ideally less than HPAGE_PMD_NR / 2 (255 on 4k page size). This will prevent undesired "creep" behavior that leads to continuously collapsing to the largest mTHP size; when we collapse, we are bringing in new non-zero pages that will, on a subsequent scan, cause the max_ptes_none check of the +1 order to always be satisfied. By limiting this to less than half the current order, we make sure we don't cause this feedback loop. max_ptes_shared and max_ptes_swap have no effect when collapsing to a mTHP, and mTHP collapse will fail on shared or swapped out pages. This seems to me to suggest that using /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none as some means of establishing a 'ratio' to do this calculation is fundamentally flawed. So surely we ought to introduce a new sysfs tunable for this? Perhaps /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio Or something like this? It's already questionable that we are taking a value that is expressed essentially in terms of PTE entries per PMD and then use it implicitly to determine the ratio for mTHP, but to then say 'oh but the default value is known-broken' is just a blocker for the series in my opinion. This really has to be done a different way I think. Cheers, LorenzoFWIW this was my version of the documentation patch: https://lore.kernel.org/all/20250211111326.14295-18-dev.jain@arm.com/ (local) The discussion about the creep problem started here: https://lore.kernel.org/all/7098654a-776d-413b-8aca-28f811620df7@arm.com/ (local) and the discussion continuing here: https://lore.kernel.org/all/37375ace-5601-4d6c-9dac-d1c8268698e9@redhat.com/ (local) ending with a summary I gave here: https://lore.kernel.org/all/8114d47b-b383-4d6e-ab65-a0e88b99c873@arm.com/ (local) This should help you with the context.Thanks and I"ll have a look, but this series is unmergeable with a broken default in /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_max_ptes_none_ratio sorry. We need to have a new tunable as far as I can tell. I also find the use of this PMD-specific value as an arbitrary way of expressing a ratio pretty gross.The first thing that comes to mind is that we can pin max_ptes_none to 255 if it exceeds 255. It's worth noting that the issue occurs only for adjacently enabled mTHP sizes. ie) if order!=HPAGE_PMD_ORDER && khugepaged_max_ptes_none > 255 temp_max_ptes_none = 255;
Oh and my second point, introducing a new tunable to control mTHP collapse may become exceedingly complex from a tuning and code management standpoint.
quoted
Thanks, Lorenzo