Thread (50 messages) 50 messages, 6 authors, 2025-07-02

Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse

From: Nico Pache <npache@redhat.com>
Date: 2025-07-02 00:00:37
Also in: linux-doc, linux-mm, lkml

On Fri, May 16, 2025 at 11:13 AM Liam R. Howlett
[off-list ref] wrote:
* Nico Pache [off-list ref] [250514 23:23]:
quoted
The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing.

Create khugepaged_collapse_single_pmd to increase code
reuse and create an entry point for future khugepaged changes.

Refactor madvise_collapse and khugepaged_scan_mm_slot to use
the new khugepaged_collapse_single_pmd function.

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------
 1 file changed, 49 insertions(+), 47 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 806bcd8c5185..5457571d505a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr,
      return result;
 }

+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static int khugepaged_collapse_single_pmd(unsigned long addr,
+                                struct vm_area_struct *vma, bool *mmap_locked,
+                                struct collapse_control *cc)
+{
+     int result = SCAN_FAIL;
+     struct mm_struct *mm = vma->vm_mm;
+
+     if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {
why IS_ENABLED(CONFIG_SHMEM) here, it seems new?
Fixed in the next version. It was a mishandled rebase conflict.
quoted
+             struct file *file = get_file(vma->vm_file);
+             pgoff_t pgoff = linear_page_index(vma, addr);
+
+             mmap_read_unlock(mm);
+             *mmap_locked = false;
+             result = khugepaged_scan_file(mm, addr, file, pgoff, cc);
+             fput(file);
+             if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+                     mmap_read_lock(mm);
+                     *mmap_locked = true;
+                     if (khugepaged_test_exit_or_disable(mm)) {
+                             result = SCAN_ANY_PROCESS;
+                             goto end;
+                     }
+                     result = collapse_pte_mapped_thp(mm, addr,
+                                                      !cc->is_khugepaged);
+                     if (result == SCAN_PMD_MAPPED)
+                             result = SCAN_SUCCEED;
+                     mmap_read_unlock(mm);
+                     *mmap_locked = false;
+             }
+     } else {
+             result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc);
+     }
+     if (cc->is_khugepaged && result == SCAN_SUCCEED)
+             ++khugepaged_pages_collapsed;
+end:
+     return result;
This function can return with mmap_read_locked or unlocked..
quoted
+}
+
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
                                          struct collapse_control *cc)
      __releases(&khugepaged_mm_lock)
@@ -2427,34 +2469,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result,
                      VM_BUG_ON(khugepaged_scan.address < hstart ||
                                khugepaged_scan.address + HPAGE_PMD_SIZE >
                                hend);
-                     if (!vma_is_anonymous(vma)) {
-                             struct file *file = get_file(vma->vm_file);
-                             pgoff_t pgoff = linear_page_index(vma,
-                                             khugepaged_scan.address);
-
-                             mmap_read_unlock(mm);
-                             mmap_locked = false;
-                             *result = hpage_collapse_scan_file(mm,
-                                     khugepaged_scan.address, file, pgoff, cc);
-                             fput(file);
-                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-                                     mmap_read_lock(mm);
-                                     if (hpage_collapse_test_exit_or_disable(mm))
-                                             goto breakouterloop;
-                                     *result = collapse_pte_mapped_thp(mm,
-                                             khugepaged_scan.address, false);
-                                     if (*result == SCAN_PMD_MAPPED)
-                                             *result = SCAN_SUCCEED;
-                                     mmap_read_unlock(mm);
-                             }
-                     } else {
-                             *result = hpage_collapse_scan_pmd(mm, vma,
-                                     khugepaged_scan.address, &mmap_locked, cc);
-                     }
-
-                     if (*result == SCAN_SUCCEED)
-                             ++khugepaged_pages_collapsed;

+                     *ngle_pmd(khugepaged_scan.address,
+                                             vma, &mmap_locked, cc);
+                     /* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */
But this comment makes it obvious that you know that..
quoted
+                     if (*result == SCAN_ANY_PROCESS)
+                             goto breakouterloop;
But later..

breakouterloop:
        mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
breakouterloop_mmap_lock:


So if you return with SCAN_ANY_PROCESS, we are holding the lock and go
immediately and drop it.  This seems unnecessarily complicated and
involves a lock.
SCAN_ANY_PROCESS indicates that the process we are working on has
either exited, or THPs have been disabled mid-scan. So we have to drop
the lock regardless.
That would leave just the khugepaged_scan_pmd() path with the
unfortunate locking mess - which is a static function and called in one
location..

Looking at what happens after the return seems to indicate we could
clean that up as well, sometime later.
I see your point, all other instances handle the unlock within their
own function and this one should too. instead of handling the unlock
in the parent function I should just return with it unlocked and have
the already established if(!mmap_locked) do the cleanup.
quoted
                      /* move to next address */
                      khugepaged_scan.address += HPAGE_PMD_SIZE;
                      progress += HPAGE_PMD_NR;
@@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
              mmap_assert_locked(mm);
              memset(cc->node_load, 0, sizeof(cc->node_load));
              nodes_clear(cc->alloc_nmask);
-             if (!vma_is_anonymous(vma)) {
-                     struct file *file = get_file(vma->vm_file);
-                     pgoff_t pgoff = linear_page_index(vma, addr);

-                     mmap_read_unlock(mm);
-                     mmap_locked = false;
-                     result = hpage_collapse_scan_file(mm, addr, file, pgoff,
-                                                       cc);
-                     fput(file);
-             } else {
-                     result = hpage_collapse_scan_pmd(mm, vma, addr,
-                                                      &mmap_locked, cc);
-             }
+             result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc);
+
              if (!mmap_locked)
                      *prev = NULL;  /* Tell caller we dropped mmap_lock */

-handle_result:
              switch (result) {
              case SCAN_SUCCEED:
              case SCAN_PMD_MAPPED:
                      ++thps;
                      break;
              case SCAN_PTE_MAPPED_HUGEPAGE:
-                     BUG_ON(mmap_locked);
-                     BUG_ON(*prev);
-                     mmap_read_lock(mm);
-                     result = collapse_pte_mapped_thp(mm, addr, true);
-                     mmap_read_unlock(mm);
-                     goto handle_result;
All of the above should probably be replaced with a BUG_ON(1) since it's
not expected now?  Or at least WARN_ON_ONCE(), but it should be safe to
continue if that's the case.
I dont think we should warn as this is the return value for indicating
that we are trying to collapse to a mTHP that is smaller than the
already established folio (see __collapse_huge_page_isolate), but
continuing should be ok.
It looks like the mmap_locked boolean is used to ensure that *prev is
safe, but we are now dropping the lock and re-acquiring it (and
potentially returning here) with it set to true, so perv will not be set
to NULL like it should.
Luckily Lorenzo just cleaned this up with the madvise code changes he
made, but yes you are correct.
I think you can handle this by ensuring that
khugepaged_collapse_single_pmd() returns with mmap_locked false in the
SCAN_ANY_PROCESS case.
quoted
-             /* Whitelisted set of results where continuing OK */
This seems worth keeping?
I'll add that back, thanks.
quoted
              case SCAN_PMD_NULL:
              case SCAN_PTE_NON_PRESENT:
              case SCAN_PTE_UFFD_WP:
I guess SCAN_ANY_PROCESS should be handled by the default case
statement?  It should probably be added to the switch?
I believe it should be handled by the default case, since we dont want
to continue, so we break out as intended.
That is to say, before your change the result would come from either
hpage_collapse_scan_file(), then lead to collapse_pte_mapped_thp()
above.
In the khugepaged case we do the following check
(khugepaged_test_exit_or_disable) before calling pte_mapped_thp, but
we weren't doing it in the madvise_collapse case. seems like we had a
bug lingering or unnecessary code in the original implementation (its
been that way since day 1). I can note the slight difference in the
commit log. I believe having the same check for both is wise, although
now I have to ask why we arent using the revalidate function like all
other callers do when they drop the lock. I will note this small
difference in the commit log, and will invest some time in the future
into cleaning up this madness. I think unifying these two callers into
one, as I'm trying to do here, will make these behavioral deviations
harder in the future, and we can have sanity knowing there is *mostly*
one way to call the collapse.
Now, you can have khugepaged_test_exit_or_disable() happen to return
SCAN_ANY_PROCESS and it will fall through to the default in this switch
statement, which seems like new behaviour?

At the very least, this information should be added to the git log on
what this patch does - if it's expected?
Will do, thanks for the thought provoking review, I had to do some
digging to verify this one :)

-- Nico
Thanks,
Liam
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help