Re: [PATCH v7 mm-new 04/10] mm: thp: enable THP allocation exclusively through khugepaged
From: Yafang Shao <hidden>
Date: 2025-09-14 02:20:31
Also in:
bpf, linux-mm
On Fri, Sep 12, 2025 at 9:48 PM Lorenzo Stoakes [off-list ref] wrote:
On Fri, Sep 12, 2025 at 02:17:01PM +0800, Yafang Shao wrote:quoted
On Thu, Sep 11, 2025 at 11:58 PM Lorenzo Stoakes [off-list ref] wrote:quoted
On Wed, Sep 10, 2025 at 10:44:41AM +0800, Yafang Shao wrote:quoted
Currently, THP allocation cannot be restricted to khugepaged alone while being disabled in the page fault path. This limitation exists because disabling THP allocation during page faults also prevents the execution of khugepaged_enter_vma() in that path.This is quite confusing, I see what you mean - you want to be able to disable page fault THP but not khugepaged THP _at the point of possibly faulting in a THP aligned VMA_. It seems this patch makes khugepaged_enter_vma() unconditional for an anonymous VMA, rather than depending on the return value specified by thp_vma_allowable_order().The functions thp_vma_allowable_order(TVA_PAGEFAULT) and thp_vma_allowable_order(TVA_KHUGEPAGED) are functionally equivalent within the page fault handler; they always yield the same result. Consequently, their execution order is irrelevant.It seems hard to definitely demonstrate that by checking !in_pf vs not in this situation :) but it seems broadly true afaict. So they differ only in that one starts khugepaged, the other tries to establish a THP on fault via create_huge_pmd().
right
quoted
The change reorders these two calls and, in doing so, also moves the call to vmf_anon_prepare(vmf). This alters the control flow: - before this change: The logic checked the return value of vmf_anon_prepare() between the two thp_vma_allowable_order() calls. thp_vma_allowable_order(TVA_PAGEFAULT); ret = vmf_anon_prepare(vmf); if (ret) return ret; thp_vma_allowable_order(TVA_KHUGEPAGED);I mean it's also _only if_ the TVA_PAGEFAULT invocation succeeds that the TVA_KHUGEPAGED one happens.quoted
- after this change: The logic now executes both thp_vma_allowable_order() calls first and does not check the return value of vmf_anon_prepare(). thp_vma_allowable_order(TVA_KHUGEPAGED); thp_vma_allowable_order(TVA_PAGEFAULT); ret = vmf_anon_prepare(vmf); // Return value 'ret' is ignored.Hm this is confusing, your code does: + if (pmd_none(*vmf.pmd)) { + if (vma_is_anonymous(vma)) + khugepaged_enter_vma(vma, vm_flags); + if (thp_vma_allowable_order(vma, vm_flags, TVA_PAGEFAULT, PMD_ORDER)) { + ret = create_huge_pmd(&vmf); + if (!(ret & VM_FAULT_FALLBACK)) + return ret; + } So the ret is absolutely not ignored, but whether it succeeds or not, we still invoke khugepaged_enter_vma(). Previously we would not have one this had vmf_anon_prepare() failed in do_huge_pmd_anonymous_page(). Which I guess is what you mean?quoted
This change is safe because the return value of vmf_anon_prepare() can be safely ignored. This function checks for transient system-level conditions (e.g., memory pressure, THP availability) that might prevent an immediate THP allocation. It does not guarantee that a subsequent allocation will succeed. This behavior is consistent with the policy in hugepage_madvise(), where a VMA is queued for khugepaged before a definitive allocation check. If the system is under pressure, khugepaged will simply retry the allocation at a more opportune time.OK. I do note though that the khugepaged being kicked off is at mm_struct level.
The unit of operation for khugepaged is the mm_struct itself. It processes the entire mm even when only a single VMA within it is a candidate for a THP.
So us trying to invoke khugepaged on the mm again is about.. something having changed that would previously have prevented us but now doesn't? That is, a product of thp_vma_allowable_order() right? So probably a sysfs change or similar? But I guess it makes sense to hook in BPF whenever this is the case because this _could_ be the point at which khugepaged enters the mm, and we want to select the allowable order at this time. So on basis of the two checks being effectively equivalent (on assumption this is always the case) then the change is fairly reasonable.
Yes, that is exactly what I mean.
Though I would put this information, that the checks are equivalent, in the commit message so it's really clear.
will add it. -- Regards Yafang