Thread (24 messages) 24 messages, 4 authors, 2025-08-01

Re: [PATCH v2 3/5] mm/huge_memory: treat MADV_COLLAPSE as an advise with PR_THP_DISABLE_EXCEPT_ADVISED

From: Usama Arif <hidden>
Date: 2025-08-01 11:26:07
Also in: linux-fsdevel, linux-mm, lkml
Subsystem: documentation, memory management - misc, memory management - thp (transparent huge page), the rest · Maintainers: Jonathan Corbet, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Linus Torvalds


On 31/07/2025 15:54, David Hildenbrand wrote:
On 31.07.25 16:38, Lorenzo Stoakes wrote:
quoted
Nits on subject:

- It's >75 chars
No big deal. If we cna come up with something shorter, good.
Changed it to "mm/huge_memory: respect MADV_COLLAPSE with PR_THP_DISABLE_EXCEPT_ADVISED"
for the next revision. That would be 73 chars :)
quoted
- advise is the verb, advice is the noun.
Yeah.
quoted
On Thu, Jul 31, 2025 at 01:27:20PM +0100, Usama Arif wrote:
quoted
From: David Hildenbrand <redacted>

Let's allow for making MADV_COLLAPSE succeed on areas that neither have
VM_HUGEPAGE nor VM_NOHUGEPAGE when we have THP disabled
unless explicitly advised (PR_THP_DISABLE_EXCEPT_ADVISED).
Hmm, I'm not sure about this.

So far this prctl() has been the only way to override MADV_COLLAPSE
behaviour, but now we're allowing for this one case to not.
This is not an override really. prctl() disallowed MADV_COLLAPSE, but in the new mode we don't want that anymore.
quoted
quoted
I suppose the precedent is that MADV_COLLAPSE overrides 'madvise' sysfs
behaviour.
quoted
I suppose what saves us here is 'advised' can be read to mean either
MADV_HUGEPAGE or MADV_COLLAPSE.
quoted
And yes, MADV_COLLAPSE is clearly the user requesting this behaviour.
Exactly.
quoted
I think the vagueness here is one that already existed, because one could
perfectly one have expected MADV_COLLAPSE to obey sysfs and require
MADV_HUGEPAGE to have been applied, but of course this is not the case.
Yes.
quoted
OK so fine.

BUT.

I think the MADV_COLLAPSE man page will need to be updated to mention this.
Yes.
Thanks, yes will do and send this along with changes to prctl man page after this
makes into mm-stable.
quoted
And I REALLY think we should update the THP doc too to mention all these
prctl() modes.

I'm not sure we cover that right now _at all_ and obviously we should
describe the new flags.

Usama - can you add a patch to this series to do that?
Good point, let's document the interaction with prctl().
I have added the following patch for the next revision. I know that a lot
of this will be in the man page as well, but I have gone the way of being very
very explicit of what are all the possible calls that can be made (hopefully thats
the right approach :))


commit 5f290d29741a514d0861d0f99c8b860ba6af9c37
Author: Usama Arif [off-list ref]
Date:   Fri Aug 1 12:05:49 2025 +0100

    docs: transhuge: document process level THP controls
    
    This includes the PR_SET_THP_DISABLE/PR_GET_THP_DISABLE pair of
    prctl calls as well the newly introduced PR_THP_DISABLE_EXCEPT_ADVISED
    flag for the PR_SET_THP_DISABLE prctl call.
    
    Signed-off-by: Usama Arif [off-list ref]
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 370fba113460..cce0a99beac8 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -225,6 +225,45 @@ to "always" or "madvise"), and it'll be automatically shutdown when
 PMD-sized THP is disabled (when both the per-size anon control and the
 top-level control are "never")
 
+process THP controls
+--------------------
+
+A process can control its own THP behaviour using the ``PR_SET_THP_DISABLE``
+and ``PR_GET_THP_DISABLE`` pair of prctl(2) calls. These calls support the
+following arguments::
+
+       prctl(PR_SET_THP_DISABLE, 1, 0, 0, 0):
+               This will set the MMF_DISABLE_THP_COMPLETELY mm flag which will
+               result in no THPs being faulted in or collapsed, irrespective
+               of global THP controls. This flag and hence the behaviour is
+               inherited across fork(2) and execve(2).
+
+       prctl(PR_SET_THP_DISABLE, 1, PR_THP_DISABLE_EXCEPT_ADVISED, 0, 0):
+               This will set the MMF_DISABLE_THP_EXCEPT_ADVISED mm flag which
+               will result in THPs being faulted in or collapsed only for
+               the following cases:
+               - Global THP controls are set to "always" or "madvise" and
+                 the process has madvised the region with either MADV_HUGEPAGE
+                 or MADV_COLLAPSE.
+               - Global THP controls is set to "never" and the process has
+                 madvised the region with MADV_COLLAPSE.
+               This flag and hence the behaviour is inherited across fork(2)
+               and execve(2).
+
+       prctl(PR_SET_THP_DISABLE, 0, 0, 0, 0):
+               This will clear the MMF_DISABLE_THP_COMPLETELY and
+               MMF_DISABLE_THP_EXCEPT_ADVISED mm flags. The process will
+               behave according to the global THP controls. This behaviour
+               will be inherited across fork(2) and execve(2).
+
+       prctl(PR_GET_THP_DISABLE, 0, 0, 0, 0):
+               This will return the THP disable mm flag status of the process
+               that was set by prctl(PR_SET_THP_DISABLE, ...).
+               i.e.
+               - 1 if MMF_DISABLE_THP_COMPLETELY flag is set
+               - 3 if MMF_DISABLE_THP_EXCEPT_ADVISED flag is set
+               - 0 otherwise.
+
 Khugepaged controls
 -------------------

quoted
quoted
MADV_COLLAPSE is a clear advise that we want to collapse.
advise -> advice.
quoted
Note that we still respect the VM_NOHUGEPAGE flag, just like
MADV_COLLAPSE always does. So consequently, MADV_COLLAPSE is now only
refused on VM_NOHUGEPAGE with PR_THP_DISABLE_EXCEPT_ADVISED.
You also need to mention the shmem change you've made I think.
Yes.
quoted
quoted
quoted
Co-developed-by: Usama Arif <redacted>
Signed-off-by: Usama Arif <redacted>
Signed-off-by: David Hildenbrand <redacted>
---
  include/linux/huge_mm.h    | 8 +++++++-
  include/uapi/linux/prctl.h | 2 +-
  mm/huge_memory.c           | 5 +++--
  mm/memory.c                | 6 ++++--
  mm/shmem.c                 | 2 +-
  5 files changed, 16 insertions(+), 7 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b0ff54eee81c..aeaf93f8ac2e 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -329,7 +329,7 @@ struct thpsize {
   * through madvise or prctl.
   */
  static inline bool vma_thp_disabled(struct vm_area_struct *vma,
-        vm_flags_t vm_flags)
+        vm_flags_t vm_flags, bool forced_collapse)
  {
      /* Are THPs disabled for this VMA? */
      if (vm_flags & VM_NOHUGEPAGE)
@@ -343,6 +343,12 @@ static inline bool vma_thp_disabled(struct vm_area_struct *vma,
       */
      if (vm_flags & VM_HUGEPAGE)
          return false;
+    /*
+     * Forcing a collapse (e.g., madv_collapse), is a clear advise to
advise -> advice.
quoted
+     * use THPs.
+     */
+    if (forced_collapse)
+        return false;
      return test_bit(MMF_DISABLE_THP_EXCEPT_ADVISED, &vma->vm_mm->flags);
  }
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 9c1d6e49b8a9..ee4165738779 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -185,7 +185,7 @@ struct prctl_mm_map {
  #define PR_SET_THP_DISABLE    41
  /*
   * Don't disable THPs when explicitly advised (e.g., MADV_HUGEPAGE /
- * VM_HUGEPAGE).
+ * VM_HUGEPAGE / MADV_COLLAPSE).
This is confusing you're mixing VMA flags with MADV ones... maybe just
stick to madvise ones, or add extra context around VM_HUGEPAGE bit?
I don't see anything confusing here, really.

But if it helps you, we can do
    (e.g., MADV_HUGEPAGE / VM_HUGEPAGE, MADV_COLLAPSE).

(reason VM_HUGEPAGE is spelled out is that there might be code where we set VM_HUGEPAGE implicitly in the kernel)
quoted
Would need to be fixed up in a prior commit obviously.
quoted
   */
  # define PR_THP_DISABLE_EXCEPT_ADVISED    (1 << 1)
  #define PR_GET_THP_DISABLE    42
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 85252b468f80..ef5ccb0ec5d5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -104,7 +104,8 @@ unsigned long __thp_vma_allowable_orders(struct vm_area_struct *vma,
  {
      const bool smaps = type == TVA_SMAPS;
      const bool in_pf = type == TVA_PAGEFAULT;
-    const bool enforce_sysfs = type != TVA_FORCED_COLLAPSE;
+    const bool forced_collapse = type == TVA_FORCED_COLLAPSE;
+    const bool enforce_sysfs = !forced_collapse;
Can we just get rid of this enforce_sysfs altogether in patch 2/5 and use
forced_collapse?
Let's do that as a separate cleanup on top. I want least churn in that patch.

(had the same idea while writing that patch, but I have other things to focus on than cleaning up all this mess)
I am happy to send this cleanup once this series makes it to mm-new and a new revision is not expected.

Thanks!
Usama
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help