Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED

From: Yang Shi <hidden>
Date: 2021-11-29 23:03:56
Also in: lkml

On Fri, Nov 26, 2021 at 1:04 AM David Hildenbrand [off-list ref] wrote:

On 26.11.21 03:52, Peter Xu wrote:

quoted

On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote:

quoted

On 25.11.21 11:24, Peter Xu wrote:

quoted

On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:

quoted

Do we have a performance evaluation how much overhead is added e.g., for
a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
covers the whole THP?

I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
x86 for both settings you suggested. I don't see any statistically
significant difference with and without the patch. Let me know if you
want me to try something else.

I'm a bit surprised that sync split thp didn't bring any extra overhead.

"unmap whole thp" is understandable from that pov, because afaict that won't
even trigger any thp split anyway even delayed, if this is the simplest case
that only this process mapped this thp, and it mapped once.

For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
fast; while what I don't understand since thp split requires all hand-made work
for copying thp flags into small pages and so on, so I thought there should at
least be some overhead measured.  Shakeel, could there be something overlooked
in the test, or maybe it's me that overlooked?

I had the same concern as what Kirill/Matthew raised in the other thread - I'm
worried proactively splitting simply because any 4k page is zapped might
quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
the defragmentation of the system memory in general.  I'm also not sure whether
that's ideal for some very common workload that frequently uses DONTNEED to
proactively drop some pages.

The pageblock corresponding to the THP is movable. So (unless we start
spilling unmovable allocations into movable pageblocks) we'd only place
movable allocations in there. Compaction will be able to migrate to
re-create a free THP.

In contrast I think, compaction will happily skip over the THP and
ignore it, because it has no clue that the THP could be repurposed by
split+migrate (at least I am not aware of code that does it).

Unless I am missing something, with the above in mind it could make
sense to split as soon as possible, even before we're under memory
pressure -- for example, for proactive compaction.

[proactive compaction could try splitting first as well I think]

But we can't rely on proactive compaction for rapid operations, because it's
still adding overhead to the overall system by split+merge, right?

Yes, but there is also direct compaction that can be triggered without
the shrinker getting involved. I think we can summarize as "there might
not be a right or wrong when to split". An application that
MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to
consume memory, yet it looks like it's still consuming that memory.

I do wonder how THP on the deferred split queue behave in respect to
page migration -- memory offlining, alloc_contig_range(). I saw reports

IIUC the old page that is on deferred split queue would be freed and
off the list once the migration is done. But the new page is *NOT*
added to the deferred split list at all. This would not cause any
severe bugs but the partial unmapped pages can no longer be shrunk
under memory pressure.

that there are some cases where THP can be problematic when
stress-testing THP:
https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt

But not sure if that's related to deferred splitting. Most probably not.

quoted

+compaction_proactiveness
+========================
+ ...
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.

Delaying split makes sense to me because after all the kernel is not aware of
the userspace's preference, so the best thing is to do nothing until necessary.

Proactively split thps in dontneed/unmap added an assumption that the userspace
wants to break the pages by default.  It's 100% true for Shakeel's use case,
but I'm not sure whether it may always be true.  That's why I thought maybe a
new interface is more proper, so we at least won't break anyone by accident.

Well, we already broke the PMD into PTEs. So the performance gain at
least for that user is really gone until we "fix that" again via
khugepaged -- which might just be configured to not "fix" if there are
empty PTEs.

It for sure is interesting if you have a COW huge page and only one
party zaps/unmaps some part.


--
Thanks,

David / dhildenb

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help