Thread (79 messages) 79 messages, 8 authors, 2025-09-15

Re: [PATCH v11 06/15] khugepaged: introduce collapse_max_ptes_none helper function

From: Lorenzo Stoakes <hidden>
Date: 2025-09-15 10:31:12
Also in: linux-doc, linux-mm, lkml

On Fri, Sep 12, 2025 at 05:26:03PM -0600, Nico Pache wrote:
On Fri, Sep 12, 2025 at 7:36 AM Lorenzo Stoakes
[off-list ref] wrote:
quoted
On Thu, Sep 11, 2025 at 09:28:01PM -0600, Nico Pache wrote:
quoted
The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that caps the max_ptes_none
to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
the max_ptes_none number by the (PMD_ORDER - target collapse order).
I would say very clearly that this is only in the mTHP case.
ack, I stole most of the verbiage here from other notes I've
previously written, but it can be improved.
Thanks.
quoted
quoted
Signed-off-by: Nico Pache <npache@redhat.com>
Hmm I thought we were going to wait for David to investigate different
approaches to this?

This is another issue with quickly going to another iteration. Though I do think
David explicitly said he'd come back with a solution?
Sorry I thought that was being done in lockstep. The last version was
about a month ago and I had a lot of changes queued up. Now that we
have collapse_max_pte_none() David has a much easier entry point to
work off :)
It'd be less problematic if this was an RFC but better to ensure all discussion
is really complete before next revision for everybody's sanity.

The ideal solution here would be to just ask David if he minded you respinning
before that was resovled I think.

But I do think generally, as I said in [0] that it'd make our lives easier if
you left perhaps a day or two after the conversation has settled just to be sure
that's that, and obviously directly ask about things like this.

I can only politely ask for this, obviously you're free to do whatever... :)

[0]: https://lore.kernel.org/all/2d5270e4-0de3-4ea3-87a4-96254eb6d446@lucifer.local/ (local)
I think he will still need this groundwork for the solution he is
working on with "eagerness". if 10 -> 511, and 9 ->255, ..., 0 -> 0.
It will still have to do the scaling. Although I believe 0-10 should
be more like 0-5 mapping to 0,32,64,128,255,511
Yeah, let's leave that discussion to the subthread on that.
quoted
So I'm not sure why we're seeing this solution here? Unless I'm missing
something?
quoted
---
 mm/khugepaged.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index b0ae0b63fc9b..4587f2def5c1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -468,6 +468,26 @@ void __khugepaged_enter(struct mm_struct *mm)
              wake_up_interruptible(&khugepaged_wait);
 }

+/* Returns the scaled max_ptes_none for a given order.
We don't start comments at the /*, please use a normal comment format like:
ack
Thanks
quoted
/*
 * xxxx
 */
quoted
+ * Caps the value to HPAGE_PMD_NR/2 - 1 in the case of mTHP collapse to prevent
This is super unclear.

It start with 'caps the xxx' which seems like you're talking generally.

You should say very clearly 'For PMD allocations we apply the
khugepaged_max_ptes_none parameter as normal. For mTHP ... [details about mTHP].
ack I will clean this up.
Thanks.
quoted
quoted
+ * a feedback loop. If max_ptes_none is greater than HPAGE_PMD_NR/2, the value
+ * would lead to collapses that introduces 2x more pages than the original
+ * number of pages. On subsequent scans, the max_ptes_none check would be
+ * satisfied and the collapses would continue until the largest order is reached
+ */
This is a super vauge explanation. Please describe the issue with creep more
clearly.
ok I will try to come up with something clearer.
Thanks.
quoted
Also aren't we saying that 511 or 0 are the sensible choices? But now somehow
that's not the case?
Oh I stated I wanted to propose this, and although there was some
pushback I still thought it deserved another attempt. This still
allows for some configurability, and with David's eagerness toggle
this still seems to fit nicely.
quoted
You're also not giving a kdoc info on what this returns.
Ok I'll add a kdoc here, why this function in particular, I'm trying
to understand why we dont add kdocs on other functions?
quoted
quoted
+static int collapse_max_ptes_none(unsigned int order)
It's a problem that existed already, but khugepaged_max_ptes_none is an unsigned
int and this returns int.

Maybe we should fix this while we're at it...
ack
Thanks
quoted
quoted
+{
+     int max_ptes_none;
+
+     if (order != HPAGE_PMD_ORDER &&
+         khugepaged_max_ptes_none >= HPAGE_PMD_NR/2)
+             max_ptes_none = HPAGE_PMD_NR/2 - 1;
+     else
+             max_ptes_none = khugepaged_max_ptes_none;
+     return max_ptes_none >> (HPAGE_PMD_ORDER - order);
+
+}
+
I really don't like this formulation, you're making it unnecessarily unclear and
now, for the super common case of PMD size, you have to figure out 'oh it's this
second branch and we're subtracting HPAGE_PMD_ORDER from HPAGE_PMD_ORDER so just
return khugepaged_max_ptes_none'. When we could... just return it no?

So something like:

#define MAX_PTES_NONE_MTHP_CAP (HPAGE_PMD_NR / 2 - 1)

static unsigned int collapse_max_ptes_none(unsigned int order)
{
        unsigned int max_ptes_none_pmd;

        /* PMD-sized THPs behave precisely the same as before. */
        if (order == HPAGE_PMD_ORDER)
                return khugepaged_max_ptes_none;

        /*
        * Bizarrely, this is expressed in terms of PTEs were this PMD-sized.
        * For the reasons stated above, we cap this value in the case of mTHP.
        */
        max_ptes_none_pmd = MIN(MAX_PTES_NONE_MTHP_CAP,
                khugepaged_max_ptes_none);

        /* Apply PMD -> mTHP scaling. */
        return max_ptes_none >> (HPAGE_PMD_ORDER - order);
}
yeah that's much cleaner thanks!
:) Cool thanks!
quoted
quoted
 void khugepaged_enter_vma(struct vm_area_struct *vma,
                        vm_flags_t vm_flags)
 {
@@ -554,7 +574,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
      struct folio *folio = NULL;
      pte_t *_pte;
      int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
-     int scaled_max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+     int scaled_max_ptes_none = collapse_max_ptes_none(order);
      const unsigned long nr_pages = 1UL << order;

      for (_pte = pte; _pte < pte + nr_pages;
--
2.51.0
Thanks, Lorenzo
Cheers, Lorenzo
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help