Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()

From: Yu Zhao <hidden>
Date: 2023-02-23 19:38:08
Also in: kvm, kvmarm, linux-arm-kernel, linux-mm, lkml

On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson [off-list ref] wrote:

On Thu, Feb 23, 2023, Yu Zhao wrote:

quoted

On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson [off-list ref] wrote:

quoted

On Thu, Feb 16, 2023, Yu Zhao wrote:

quoted

  kswapd (MGLRU before)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.97%  try_to_shrink_lruvec
            99.06%  evict_folios
              97.41%  shrink_folio_list
                31.33%  folio_referenced
                  31.06%  rmap_walk_file
                    30.89%  folio_referenced_one
                      20.83%  __mmu_notifier_clear_flush_young
                        20.54%  kvm_mmu_notifier_clear_flush_young
  =>                      19.34%  _raw_write_lock

  kswapd (MGLRU after)
    100.00%  balance_pgdat
      100.00%  shrink_node
        100.00%  shrink_one
          99.97%  try_to_shrink_lruvec
            99.51%  evict_folios
              71.70%  shrink_folio_list
                7.08%  folio_referenced
                  6.78%  rmap_walk_file
                    6.72%  folio_referenced_one
                      5.60%  lru_gen_look_around
  =>                    1.53%  __mmu_notifier_test_clear_young

Do you happen to know how much of the improvement is due to batching, and how
much is due to using a walkless walk?

No. I have three benchmarks running at the moment:
1. Windows SQL server guest on x86 host,
2. Apache Spark guest on arm64 host, and
3. Memcached guest on ppc64 host.

If you are really interested in that, I can reprioritize -- I need to
stop 1) and use that machine to get the number for you.

After looking at the "MGLRU before" stack again, it's definitely worth getting
those numbers.  The "before" isn't just taking mmu_lock, it's taking mmu_lock for
write _and_ flushing remote TLBs on _every_ PTE.

Correct.

I suspect the batching is a
tiny percentage of the overall win (might be larger with RETPOLINE and friends),

Same here.

and that the bulk of the improvement comes from avoiding the insanity of
kvm_mmu_notifier_clear_flush_young().

Speaking of which, what would it take to drop mmu_notifier_clear_flush_young()
entirely?

That's not my call :)

Adding Johannes.

I.e. why can MGLRU tolerate stale information but !MGLRU cannot?

Good question. The native clear API doesn't flush:

  int ptep_clear_flush_young(struct vm_area_struct *vma,
                             unsigned long address, pte_t *ptep)
  {
          /*
           * On x86 CPUs, clearing the accessed bit without a TLB flush
           * doesn't cause data corruption. [ It could cause incorrect
           * page aging and the (mistaken) reclaim of hot pages, but the
           * chance of that should be relatively low. ]
           *
           * So as a performance optimization don't flush the TLB when
           * clearing the accessed bit, it will eventually be flushed by
           * a context switch or a VM operation anyway. [ In the rare
           * event of it not getting flushed for a long time the delay
           * shouldn't really matter because there's no real memory
           * pressure for swapout to react to. ]
           */
          return ptep_test_and_clear_young(vma, address, ptep);
  }

If
we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_clear_young()
instead, would anyone notice, let alone care?

I tend to agree.

quoted

@@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c
      if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG))
              caps |= BIT(LRU_GEN_NONLEAF_YOUNG);

+     if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK))
+             caps |= BIT(LRU_GEN_SPTE_WALK);

As alluded to in patch 1, unless batching the walks even if KVM does _not_ support
a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(),
I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK.  Or
if we want to avoid batching when there are no mmu_notifier listeners, probe
mmu_notifiers.  But don't call into KVM directly.

I'm not sure I fully understand. Let's present the problem on the MM
side: assuming KVM supports lockless walks, batching can still be
worse (very unlikely), because GFNs can exhibit no memory locality at
all. So this option allows userspace to disable batching.

I'm asking the opposite.  Is there a scenario where batching+lock is worse than
!batching+lock?  If not, then don't make batching depend on lockless walks.

Yes, absolutely. batching+lock means we take/release mmu_lock for
every single PTE in the entire VA space -- each small batch contains
64 PTEs but the entire batch is the whole KVM.

quoted

I fully understand why you don't want MM to call into KVM directly. No
acceptable ways to set up a clear interface between MM and KVM other
than the MMU notifier?

There are several options I can think of, but before we go spend time designing
the best API, I'd rather figure out if we care in the first place.

This is self serving -- MGLRU would be the only user in the near
future. But I never assume there will be no common ground, at least it
doesn't hurt to check.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help