Re: [PATCH v7 02/12] introduce khugepaged_collapse_single_pmd to unify khugepaged and madvise_collapse
From: Nico Pache <npache@redhat.com>
Date: 2025-07-02 00:00:37
Also in:
linux-doc, linux-mm, lkml
On Fri, May 16, 2025 at 11:13 AM Liam R. Howlett [off-list ref] wrote:
* Nico Pache [off-list ref] [250514 23:23]:quoted
The khugepaged daemon and madvise_collapse have two different implementations that do almost the same thing. Create khugepaged_collapse_single_pmd to increase code reuse and create an entry point for future khugepaged changes. Refactor madvise_collapse and khugepaged_scan_mm_slot to use the new khugepaged_collapse_single_pmd function. Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Signed-off-by: Nico Pache <npache@redhat.com> --- mm/khugepaged.c | 96 +++++++++++++++++++++++++------------------------ 1 file changed, 49 insertions(+), 47 deletions(-)diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 806bcd8c5185..5457571d505a 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c@@ -2353,6 +2353,48 @@ static int khugepaged_scan_file(struct mm_struct *mm, unsigned long addr, return result; } +/* + * Try to collapse a single PMD starting at a PMD aligned addr, and return + * the results. + */ +static int khugepaged_collapse_single_pmd(unsigned long addr, + struct vm_area_struct *vma, bool *mmap_locked, + struct collapse_control *cc) +{ + int result = SCAN_FAIL; + struct mm_struct *mm = vma->vm_mm; + + if (IS_ENABLED(CONFIG_SHMEM) && !vma_is_anonymous(vma)) {why IS_ENABLED(CONFIG_SHMEM) here, it seems new?
Fixed in the next version. It was a mishandled rebase conflict.
quoted
+ struct file *file = get_file(vma->vm_file); + pgoff_t pgoff = linear_page_index(vma, addr); + + mmap_read_unlock(mm); + *mmap_locked = false; + result = khugepaged_scan_file(mm, addr, file, pgoff, cc); + fput(file); + if (result == SCAN_PTE_MAPPED_HUGEPAGE) { + mmap_read_lock(mm); + *mmap_locked = true; + if (khugepaged_test_exit_or_disable(mm)) { + result = SCAN_ANY_PROCESS; + goto end; + } + result = collapse_pte_mapped_thp(mm, addr, + !cc->is_khugepaged); + if (result == SCAN_PMD_MAPPED) + result = SCAN_SUCCEED; + mmap_read_unlock(mm); + *mmap_locked = false; + } + } else { + result = khugepaged_scan_pmd(mm, vma, addr, mmap_locked, cc); + } + if (cc->is_khugepaged && result == SCAN_SUCCEED) + ++khugepaged_pages_collapsed; +end: + return result;This function can return with mmap_read_locked or unlocked..quoted
+} + static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, struct collapse_control *cc) __releases(&khugepaged_mm_lock)@@ -2427,34 +2469,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, int *result, VM_BUG_ON(khugepaged_scan.address < hstart || khugepaged_scan.address + HPAGE_PMD_SIZE > hend); - if (!vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, - khugepaged_scan.address); - - mmap_read_unlock(mm); - mmap_locked = false; - *result = hpage_collapse_scan_file(mm, - khugepaged_scan.address, file, pgoff, cc); - fput(file); - if (*result == SCAN_PTE_MAPPED_HUGEPAGE) { - mmap_read_lock(mm); - if (hpage_collapse_test_exit_or_disable(mm)) - goto breakouterloop; - *result = collapse_pte_mapped_thp(mm, - khugepaged_scan.address, false); - if (*result == SCAN_PMD_MAPPED) - *result = SCAN_SUCCEED; - mmap_read_unlock(mm); - } - } else { - *result = hpage_collapse_scan_pmd(mm, vma, - khugepaged_scan.address, &mmap_locked, cc); - } - - if (*result == SCAN_SUCCEED) - ++khugepaged_pages_collapsed; + *ngle_pmd(khugepaged_scan.address, + vma, &mmap_locked, cc); + /* If we return SCAN_ANY_PROCESS we are holding the mmap_lock */But this comment makes it obvious that you know that..quoted
+ if (*result == SCAN_ANY_PROCESS) + goto breakouterloop;But later.. breakouterloop: mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */ breakouterloop_mmap_lock: So if you return with SCAN_ANY_PROCESS, we are holding the lock and go immediately and drop it. This seems unnecessarily complicated and involves a lock.
SCAN_ANY_PROCESS indicates that the process we are working on has either exited, or THPs have been disabled mid-scan. So we have to drop the lock regardless.
That would leave just the khugepaged_scan_pmd() path with the unfortunate locking mess - which is a static function and called in one location.. Looking at what happens after the return seems to indicate we could clean that up as well, sometime later.
I see your point, all other instances handle the unlock within their own function and this one should too. instead of handling the unlock in the parent function I should just return with it unlocked and have the already established if(!mmap_locked) do the cleanup.
quoted
/* move to next address */ khugepaged_scan.address += HPAGE_PMD_SIZE; progress += HPAGE_PMD_NR;@@ -2773,36 +2793,18 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, mmap_assert_locked(mm); memset(cc->node_load, 0, sizeof(cc->node_load)); nodes_clear(cc->alloc_nmask); - if (!vma_is_anonymous(vma)) { - struct file *file = get_file(vma->vm_file); - pgoff_t pgoff = linear_page_index(vma, addr); - mmap_read_unlock(mm); - mmap_locked = false; - result = hpage_collapse_scan_file(mm, addr, file, pgoff, - cc); - fput(file); - } else { - result = hpage_collapse_scan_pmd(mm, vma, addr, - &mmap_locked, cc); - } + result = khugepaged_collapse_single_pmd(addr, vma, &mmap_locked, cc); + if (!mmap_locked) *prev = NULL; /* Tell caller we dropped mmap_lock */ -handle_result: switch (result) { case SCAN_SUCCEED: case SCAN_PMD_MAPPED: ++thps; break; case SCAN_PTE_MAPPED_HUGEPAGE: - BUG_ON(mmap_locked); - BUG_ON(*prev); - mmap_read_lock(mm); - result = collapse_pte_mapped_thp(mm, addr, true); - mmap_read_unlock(mm); - goto handle_result;All of the above should probably be replaced with a BUG_ON(1) since it's not expected now? Or at least WARN_ON_ONCE(), but it should be safe to continue if that's the case.
I dont think we should warn as this is the return value for indicating that we are trying to collapse to a mTHP that is smaller than the already established folio (see __collapse_huge_page_isolate), but continuing should be ok.
It looks like the mmap_locked boolean is used to ensure that *prev is safe, but we are now dropping the lock and re-acquiring it (and potentially returning here) with it set to true, so perv will not be set to NULL like it should.
Luckily Lorenzo just cleaned this up with the madvise code changes he made, but yes you are correct.
I think you can handle this by ensuring that khugepaged_collapse_single_pmd() returns with mmap_locked false in the SCAN_ANY_PROCESS case.quoted
- /* Whitelisted set of results where continuing OK */This seems worth keeping?
I'll add that back, thanks.
quoted
case SCAN_PMD_NULL: case SCAN_PTE_NON_PRESENT: case SCAN_PTE_UFFD_WP:I guess SCAN_ANY_PROCESS should be handled by the default case statement? It should probably be added to the switch?
I believe it should be handled by the default case, since we dont want to continue, so we break out as intended.
That is to say, before your change the result would come from either hpage_collapse_scan_file(), then lead to collapse_pte_mapped_thp() above.
In the khugepaged case we do the following check (khugepaged_test_exit_or_disable) before calling pte_mapped_thp, but we weren't doing it in the madvise_collapse case. seems like we had a bug lingering or unnecessary code in the original implementation (its been that way since day 1). I can note the slight difference in the commit log. I believe having the same check for both is wise, although now I have to ask why we arent using the revalidate function like all other callers do when they drop the lock. I will note this small difference in the commit log, and will invest some time in the future into cleaning up this madness. I think unifying these two callers into one, as I'm trying to do here, will make these behavioral deviations harder in the future, and we can have sanity knowing there is *mostly* one way to call the collapse.
Now, you can have khugepaged_test_exit_or_disable() happen to return SCAN_ANY_PROCESS and it will fall through to the default in this switch statement, which seems like new behaviour? At the very least, this information should be added to the git log on what this patch does - if it's expected?
Will do, thanks for the thought provoking review, I had to do some digging to verify this one :) -- Nico
Thanks, Liam