Re: [RFC 0/2] mm: introduce THP deferred setting

From: Nico Pache <npache@redhat.com>
Date: 2024-08-26 21:15:15
Also in: linux-mm, lkml

On Mon, Aug 26, 2024 at 10:47 AM Usama Arif [off-list ref] wrote:



On 26/08/2024 11:40, Nico Pache wrote:

quoted

On Tue, Jul 30, 2024 at 4:37 PM Nico Pache [off-list ref] wrote:

quoted

Hi Zi Yan,
On Mon, Jul 29, 2024 at 7:26 PM Zi Yan [off-list ref] wrote:

quoted

+Kirill

On 29 Jul 2024, at 18:27, Nico Pache wrote:

quoted

We've seen cases were customers switching from RHEL7 to RHEL8 see a
significant increase in the memory footprint for the same workloads.

Through our investigations we found that a large contributing factor to
the increase in RSS was an increase in THP usage.

Any knob is changed from RHEL7 to RHEL8 to cause more THP usage?

IIRC, most of the systems tuning is the same. We attributed the
increase in THP usage to a combination of improvements in the kernel,
and improvements in the libraries (better alignments). That allowed
THP allocations to succeed at a higher rate. I can go back and confirm
this tomorrow though.

quoted

For workloads like MySQL, or when using allocators like jemalloc, it is
often recommended to set /transparent_hugepages/enabled=never. This is
in part due to performance degradations and increased memory waste.

This series introduces enabled=defer, this setting acts as a middle
ground between always and madvise. If the mapping is MADV_HUGEPAGE, the
page fault handler will act normally, making a hugepage if possible. If
the allocation is not MADV_HUGEPAGE, then the page fault handler will
default to the base size allocation. The caveat is that khugepaged can
still operate on pages thats not MADV_HUGEPAGE.

Why? If user does not explicitly want huge page, why bother providing huge
pages? Wouldn't it increase memory footprint?

So we have "always", which will always try to allocate a THP when it
can. This setting gives good performance in a lot of conditions, but
tends to waste memory. Additionally applications DON'T need to be
modified to take advantage of THPs.

We have "madvise" which will only satisfy allocations that are
MADV_HUGEPAGE, this gives you granular control, and a lot of times
these madvises come from libraries. Unlike "always" you DO need to
modify your application if you want to use THPs.

Then we have "never", which of course, never allocates THPs.

Ok. back to your question, like "madvise", "defer" gives you the
benefits of THPs when you specifically know you want them
(madv_hugepage), but also benefits applications that dont specifically
ask for them (or cant be modified to ask for them), like "always"
does. The applications that dont ask for THPs must wait for khugepaged
to get them (avoid insertions at PF time)-- this curbs a lot of memory
waste, and gives an increased tunability over "always". Another added
benefit is that khugepaged will most likely not operate on short lived
allocations, meaning that only longstanding memory will be collapsed
to THPs.

The memory waste can be tuned with max_ptes_none... lets say you want
~90% of your PMD to be full before collapsing into a huge page. simply
set max_ptes_none=64. or no waste, set max_ptes_none=0, requiring the
512 pages to be present before being collapsed.

quoted

This allows for two things... one, applications specifically designed to
use hugepages will get them, and two, applications that don't use
hugepages can still benefit from them without aggressively inserting
THPs at every possible chance. This curbs the memory waste, and defers
the use of hugepages to khugepaged. Khugepaged can then scan the memory
for eligible collapsing.

khugepaged would replace application memory with huge pages without specific
goal. Why not use a user space agent with process_madvise() to collapse
huge pages? Admin might have more knobs to tweak than khugepaged.

The benefits of "always" are that no userspace agent is needed, and
applications dont have to be modified to use madvise(MADV_HUGEPAGE) to
benefit from THPs. This setting hopes to gain some of the same
benefits without the significant waste of memory and an increased
tunability.

future changes I have in the works are to make khugepaged more
"smart". Moving it away from the round robin fashion it currently
operates in, to instead make smart and informed decisions of what
memory to collapse (and potentially split).

Hopefully that helped explain the motivation for this new setting!

Any last comments before I resend this?

Ive been made aware of
https://lore.kernel.org/all/20240730125346.1580150-1-usamaarif642@gmail.com/T/#u (local)
which introduces THP splitting. These are both trying to achieve the
same thing through different means. Our approach leverages khugepaged
to promote pages, while Usama's uses the reclaim path to demote
hugepages and shrink the underlying memory.

I will leave it up to reviewers to determine which is better; However,
we can't have both, as we'd be introducing trashing conditions.

Hi,

Just inserting this here from my cover letter:

Waiting for khugepaged to scan memory and
collapse pages into THP can be slow and unpredictable in terms of performance

Obviously not part of my patchset here, but I have been testing some
changes to khugepaged to make it more aware of what processes are hot.
Ideally then it can make better choices of what to operate on.

(i.e. you dont know when the collapse will happen), while production
environments require predictable performance. If there is enough memory
available, its better for both performance and predictability to have
a THP from fault time, i.e. THP=always rather than wait for khugepaged
to collapse it, and deal with sparsely populated THPs when the system is
running out of memory.

I just went through your patches, and am not sure why we can't have both?

Fair point, we can. I've been playing around with splitting hugepages
and via khugepaged and was thinking of the trashing conditions there--
but your implementation takes a different approach.
I've been working on performance testing my "defer" changes, once I
find the appropriate workloads I'll try adding your changes to the
mix. I have a feeling my approach is better for latency sensitive
workloads, while yours is better for throughput, but let me find a way
to confirm that.

Both use max_ptes_none as the tunable. If the number of zero-filled pages
is above max_ptes_none, the shrinker will split them, and khugepaged will not collapse
them (SCAN_EXCEED_NONE_PTE), so I don't see how it causes trashing?

quoted

Cheers,
-- Nico

quoted

Cheer!
-- Nico

quoted

Admins may want to lower max_ptes_none, if not, khugepaged may
aggressively collapse single allocations into hugepages.

RFC note
==========
Im not sure if im missing anything related to the mTHP
changes. I think now that we have hugepage_pmd_enabled in
commit 00f58104202c ("mm: fix khugepaged activation policy") everything
should work as expected.

Nico Pache (2):
  mm: defer THP insertion to khugepaged
  mm: document transparent_hugepage=defer usage

 Documentation/admin-guide/mm/transhuge.rst | 18 ++++++++++---
 include/linux/huge_mm.h                    | 15 +++++++++--
 mm/huge_memory.c                           | 31 +++++++++++++++++++---
 3 files changed, 55 insertions(+), 9 deletions(-)

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <redacted>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Lance Yang <redacted>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Rafael Aquini <redacted>
Cc: Andrea Arcangeli <redacted>
Cc: Jonathan Corbet <corbet@lwn.net>
--
2.45.2

--
Best Regards,
Yan, Zi

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help