Thread (61 messages) 61 messages, 7 authors, 2025-09-25

Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged

From: Yafang Shao <hidden>
Date: 2025-09-14 02:20:31
Also in: bpf, linux-mm

On Fri, Sep 12, 2025 at 9:48 PM Lorenzo Stoakes
[off-list ref] wrote:
On Fri, Sep 12, 2025 at 02:17:01PM +0800, Yafang Shao wrote:
quoted
On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes
[off-list ref] wrote:
quoted
On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:
quoted
Currently, THP allocation cannot be restricted to khugepaged alone while
being disabled in the page fault path. This limitation exists because
disabling THP allocation during page faults also prevents the execution of
khugepaged_enter_vma() in that path.
This is quite confusing, I see what you mean - you want to be able to disable
page fault THP but not khugepaged THP _at the point of possibly faulting in a
THP aligned VMA_.

It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous
VMA, rather than depending on the return value specified by
thp_vma_allowable_order().
The functions thp_vma_allowable_order(TVA_PAGEFAULT) and
thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent
within the page fault handler; they always yield the same result.
Consequently, their execution order is irrelevant.
It seems hard to definitely demonstrate that by checking !in_pf vs not in this
situation :) but it seems broadly true afaict.

So they differ only in that one starts khugepaged, the other tries to
establish a THP on fault via create_huge_pmd().
right
quoted
The change reorders these two calls and, in doing so, also moves the
call to vmf_anon_prepare(vmf). This alters the control flow:
- before this change:  The logic checked the return value of
vmf_anon_prepare() between the two thp_vma_allowable_order() calls.

    thp_vma_allowable_order(TVA_PAGEFAULT);
    ret = vmf_anon_prepare(vmf);
    if (ret)
        return ret;
    thp_vma_allowable_order(TVA_KHUGEPAGED);
I mean it's also _only if_ the TVA_PAGEFAULT invocation succeeds that the
TVA_KHUGEPAGED one happens.
quoted
 - after this change: The logic now executes both
thp_vma_allowable_order() calls first and does not check the return
value of vmf_anon_prepare().

    thp_vma_allowable_order(TVA_KHUGEPAGED);
    thp_vma_allowable_order(TVA_PAGEFAULT);
    ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.
Hm this is confusing, your code does:

+       if (pmd_none(*vmf.pmd)) {
+               if (vma_is_anonymous(vma))
+                       khugepaged_enter_vma(vma, vm_flags);
+               if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) {
+                       ret = create_huge_pmd(&vmf);
+                       if (!(ret & VM_FAULT_FALLBACK))
+                               return ret;
+               }

So the ret is absolutely not ignored, but whether it succeeds or not, we still
invoke khugepaged_enter_vma().

Previously we would not have one this had vmf_anon_prepare() failed in
do_huge_pmd_anonymous_page().

Which I guess is what you mean?
quoted
This change is safe because the return value of vmf_anon_prepare() can
be safely ignored. This function checks for transient system-level
conditions (e.g., memory pressure, THP availability) that might
prevent an immediate THP allocation. It does not guarantee that a
subsequent allocation will succeed.

This behavior is consistent with the policy in hugepage_madvise(),
where a VMA is queued for khugepaged before a definitive allocation
check. If the system is under pressure, khugepaged will simply retry
the allocation at a more opportune time.
OK. I do note though that the khugepaged being kicked off is at mm_struct level.
The unit of operation for khugepaged is the mm_struct itself. It
processes the entire mm even when only a single VMA within it is a
candidate for a THP.
So us trying to invoke khugepaged on the mm again is about.. something having
changed that would previously have prevented us but now doesn't?

That is, a product of thp_vma_allowable_order() right?

So probably a sysfs change or similar?

But I guess it makes sense to hook in BPF whenever this is the case because this
_could_ be the point at which khugepaged enters the mm, and we want to select
the allowable order at this time.

So on basis of the two checks being effectively equivalent (on assumption this
is always the case) then the change is fairly reasonable.
Yes, that is exactly what I mean.
Though I would put this information, that the checks are equivalent, in the
commit message so it's really clear.
will add it.

-- 
Regards
Yafang
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help