Re: [PATCH v7 mm-new 02/10] mm: thp: add support for BPF based THP order selection

From: Yafang Shao <hidden>
Date: 2025-09-14 02:22:41
Also in: bpf, linux-mm

On Fri, Sep 12, 2025 at 7:53 PM Lorenzo Stoakes
[off-list ref] wrote:

On Fri, Sep 12, 2025 at 04:28:46PM +0800, Yafang Shao wrote:

quoted

On Thu, Sep 11, 2025 at 10:34 PM Lorenzo Stoakes
[off-list ref] wrote:

quoted

On Wed, Sep 10, 2025 at 10:44:39AM +0800, Yafang Shao wrote:

quoted

This patch introduces a new BPF struct_ops called bpf_thp_ops for dynamic
THP tuning. It includes a hook bpf_hook_thp_get_order(), allowing BPF
programs to influence THP order selection based on factors such as:
- Workload identity
  For example, workloads running in specific containers or cgroups.
- Allocation context
  Whether the allocation occurs during a page fault, khugepaged, swap or
  other paths.
- VMA's memory advice settings
  MADV_HUGEPAGE or MADV_NOHUGEPAGE
- Memory pressure
  PSI system data or associated cgroup PSI metrics

The kernel API of this new BPF hook is as follows,

/**
 * @thp_order_fn_t: Get the suggested THP orders from a BPF program for allocation
 * @vma: vm_area_struct associated with the THP allocation
 * @vma_type: The VMA type, such as BPF_THP_VM_HUGEPAGE if VM_HUGEPAGE is set
 *            BPF_THP_VM_NOHUGEPAGE if VM_NOHUGEPAGE is set, or BPF_THP_VM_NONE if
 *            neither is set.
 * @tva_type: TVA type for current @vma
 * @orders: Bitmask of requested THP orders for this allocation
 *          - PMD-mapped allocation if PMD_ORDER is set
 *          - mTHP allocation otherwise
 *
 * Return: The suggested THP order from the BPF program for allocation. It will
 *         not exceed the highest requested order in @orders. Return -1 to
 *         indicate that the original requested @orders should remain unchanged.
 */
typedef int thp_order_fn_t(struct vm_area_struct *vma,
                         enum bpf_thp_vma_type vma_type,
                         enum tva_type tva_type,
                         unsigned long orders);

Only a single BPF program can be attached at any given time, though it can
be dynamically updated to adjust the policy. The implementation supports
anonymous THP, shmem THP, and mTHP, with future extensions planned for
file-backed THP.

This functionality is only active when system-wide THP is configured to
madvise or always mode. It remains disabled in never mode. Additionally,
if THP is explicitly disabled for a specific task via prctl(), this BPF
functionality will also be unavailable for that task.

This feature requires CONFIG_BPF_GET_THP_ORDER (marked EXPERIMENTAL) to be
enabled. Note that this capability is currently unstable and may undergo
significant changes—including potential removal—in future kernel versions.

Thanks for highlighting.

quoted

Suggested-by: David Hildenbrand <redacted>
Suggested-by: Lorenzo Stoakes <redacted>
Signed-off-by: Yafang Shao <redacted>
---
 MAINTAINERS             |   1 +
 include/linux/huge_mm.h |  26 ++++-
 mm/Kconfig              |  12 ++
 mm/Makefile             |   1 +
 mm/huge_memory_bpf.c    | 243 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 280 insertions(+), 3 deletions(-)
 create mode 100644 mm/huge_memory_bpf.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 8fef05bc2224..d055a3c95300 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS

@@ -16252,6 +16252,7 @@ F:    include/linux/huge_mm.h
 F:   include/linux/khugepaged.h
 F:   include/trace/events/huge_memory.h
 F:   mm/huge_memory.c
+F:   mm/huge_memory_bpf.c

THanks!

quoted

 F:   mm/khugepaged.c
 F:   mm/mm_slot.h
 F:   tools/testing/selftests/mm/khugepaged.c

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 23f124493c47..f72a5fd04e4f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h

@@ -56,6 +56,7 @@ enum transparent_hugepage_flag {
      TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
      TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
      TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
+     TRANSPARENT_HUGEPAGE_BPF_ATTACHED,      /* BPF prog is attached */
 };

 struct kobject;

@@ -270,6 +271,19 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
                                       enum tva_type type,
                                       unsigned long orders);

+#ifdef CONFIG_BPF_GET_THP_ORDER
+unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
+                     enum tva_type type, unsigned long orders);

Thanks for renaming!

quoted

+#else
+static inline unsigned long
+bpf_hook_thp_get_orders(struct vm_area_struct *vma, vm_flags_t vma_flags,
+                     enum tva_type tva_flags, unsigned long orders)
+{
+     return orders;
+}
+#endif
+
 /**
  * thp_vma_allowable_orders - determine hugepage orders that are allowed for vma
  * @vma:  the vm area to check

@@ -291,6 +305,12 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
                                     enum tva_type type,
                                     unsigned long orders)
 {
+     unsigned long bpf_orders;
+
+     bpf_orders = bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
+     if (!bpf_orders)
+             return 0;

I think it'd be easier to just do:

        /* The BPF-specified order overrides which order is selected. */
        orders &= bpf_hook_thp_get_orders(vma, vm_flags, type, orders);
        if (!orders)
                return 0;

good suggestion!

Thanks, though this does come back to 'are we masking on orders' or not.

Obviously this is predicated on that being the case.

quoted

 struct thpsize {

diff --git a/mm/Kconfig b/mm/Kconfig
index d1ed839ca710..4d89d2158f10 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig

@@ -896,6 +896,18 @@ config NO_PAGE_MAPCOUNT

        EXPERIMENTAL because the impact of some changes is still unclear.

+config BPF_GET_THP_ORDER

Yeah, I think we maybe need to sledgehammer this as already Lance was confused
as to the permenancy of this, and I feel that users might be too, even with the
'(EXPERIMENTAL)' bit.

So maybe

config BPF_GET_THP_ORDER_EXPERIMENTAL

Just to hammer it home?

ack

Thanks!

quoted

+     bool "BPF-based THP order selection (EXPERIMENTAL)"
+     depends on TRANSPARENT_HUGEPAGE && BPF_SYSCALL
+
+     help
+       Enable dynamic THP order selection using BPF programs. This
+       experimental feature allows custom BPF logic to determine optimal
+       transparent hugepage allocation sizes at runtime.
+
+       WARNING: This feature is unstable and may change in future kernel
+       versions.
+
 endif # TRANSPARENT_HUGEPAGE

 # simple helper to make the code a bit easier to read

diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..f180332f2ad0 100644
--- a/mm/Makefile
+++ b/mm/Makefile

@@ -99,6 +99,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o
 obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
+obj-$(CONFIG_BPF_GET_THP_ORDER) += huge_memory_bpf.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o

diff --git a/mm/huge_memory_bpf.c b/mm/huge_memory_bpf.c
new file mode 100644
index 000000000000..525ee22ab598
--- /dev/null
+++ b/mm/huge_memory_bpf.c

@@ -0,0 +1,243 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF-based THP policy management
+ *
+ * Author: Yafang Shao <laoar.shao@gmail.com>
+ */
+
+#include <linux/bpf.h>
+#include <linux/btf.h>
+#include <linux/huge_mm.h>
+#include <linux/khugepaged.h>
+
+enum bpf_thp_vma_type {
+     BPF_THP_VM_NONE = 0,
+     BPF_THP_VM_HUGEPAGE,    /* VM_HUGEPAGE */
+     BPF_THP_VM_NOHUGEPAGE,  /* VM_NOHUGEPAGE */
+};

I'm really not so sure how useful this is - can't a user just ascertain this
from the VMA flags themselves?

I assume you are referring to checking flags from vma->vm_flags.
There is an exception where we cannot use vma->vm_flags: in
hugepage_madvise(), which calls khugepaged_enter_vma(vma, *vm_flags).

At this point, the VM_HUGEPAGE flag has not been set in vma->vm_flags
yet. Therefore, we must pass the separate *vm_flags variable.
Perhaps we can simplify the logic with the following change?

Ugh god.

I guess this is the workaround for the vm_flags thing right.

quoted

diff --git a/mm/madvise.c b/mm/madvise.c
index 35ed4ab0d7c5..5755de80a4d7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c

@@ -1425,6 +1425,8 @@ static int madvise_vma_behavior(struct

madvise_behavior *madv_behavior)
        VM_WARN_ON_ONCE(madv_behavior->lock_mode != MADVISE_MMAP_WRITE_LOCK);

        error = madvise_update_vma(new_flags, madv_behavior);
+       if (new_flags & VM_HUGEPAGE)
+               khugepaged_enter_vma(vma);

Hm ok, that's not such a bad idea, though ofc this should be something like:

        if (!error && (new_flags & VM_HUGEPAGE))
                khugepaged_enter_vma(vma);

ack

And obviously dropping this khugepaged_enter_vma() from hugepage_madvise().

Thanks for the reminder.

-- 
Regards
Yafang

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help