Thread (39 messages) 39 messages, 4 authors, 2023-02-23

Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

From: Yu Zhao <hidden>
Date: 2023-02-23 20:49:58
Also in: kvm, kvmarm, linux-arm-kernel, linux-mm, lkml

On Thu, Feb 23, 2023 at 1:29 PM Sean Christopherson [off-list ref] wrote:
On Thu, Feb 23, 2023, Yu Zhao wrote:
quoted
On Thu, Feb 23, 2023 at 12:58 PM Sean Christopherson [off-list ref] wrote:
quoted
On Thu, Feb 23, 2023, Yu Zhao wrote:
quoted
On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson [off-list ref] wrote:
quoted
On Thu, Feb 23, 2023, Yu Zhao wrote:
quoted
quoted
As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK.  Or
if we want to avoid batching when there are no mmu_notifier listeners, probe
mmu_notifiers.  But don't call into KVM directly.
I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.
I'm asking the opposite.  Is there a scenario where batching+lock is worse than
!batching+lock?  If not, then don't make batching depend on lockless walks.
Yes, absolutely. batching+lock means we take/release mmu_lock for
every single PTE in the entire VA space -- each small batch contains
64 PTEs but the entire batch is the whole KVM.
Who is "we"?
Oops -- shouldn't have used "we".
quoted
I don't see anything in the kernel that triggers walking the whole
VMA, e.g. lru_gen_look_around() limits the walk to a single PMD.  I feel like I'm
missing something...
walk_mm() -> walk_pud_range() -> walk_pmd_range() -> walk_pte_range()
-> test_spte_young() -> mmu_notifier_test_clear_young().

MGLRU takes two passes: during the first pass, it sweeps entire VA
space on each MM (per MM/KVM); during the second pass, it uses the rmap on each
folio (per folio).
Ah.  IIUC, userspace can use LRU_GEN_SPTE_WALK to control whether or not to walk
secondary MMUs, and the kernel further restricts LRU_GEN_SPTE_WALK to secondary
MMUs that implement a lockless walk.  And if the answer is "no", secondary MMUs
are simply not consulted.
Sorry for the bad naming -- probably LRU_GEN_SPTE_BATCH_WALK would be
less confusing.

MGLRU always consults the secondary MMU for each page it's going to
reclaim (during the second pass), i.e., it checks the A-bit in the
SPTE mapping a page (by the rmap) it plans to reclaim so that it won't
take a hot page away from KVM.

If the lockless walk is supported, MGLRU doesn't need to work at page
granularity: (physical) pages on the LRU list may have nothing in
common (e.g., from different processes), checking their PTEs/SPTEs one
by one is expensive. Instead, it sweeps the entire KVM spaces in the
first pass and checks the *adjacent SPTEs* of a page it plans to
reclaim in the second pass. Both rely on the *assumption* there would
be some spatial locality to exploit. This assumption can be wrong, and
LRU_GEN_SPTE_WALK disables it.
If that's correct, then the proper way to handle this is by extending mmu_notifier_ops
to query (a) if there's at least one register listeners that implements
test_clear_young() and (b) if all registered listeners that implement test_clear_young()
support lockless walks.  That avoids direct dependencies on KVM, and avoids making
assumptions that may not always hold true, e.g. that KVM is the only mmu_notifier
user that supports the young APIs.

P.S. all of this info absolutely belongs in documentation and/or changelogs.
Will do.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help