Re: [PATCH mm-unstable v1 5/5] mm: multi-gen LRU: use mmu_notifier_test_clear_young()
From: Yu Zhao <hidden>
Date: 2023-02-23 19:38:08
Also in:
kvm, kvmarm, linux-arm-kernel, linux-mm, lkml
On Thu, Feb 23, 2023 at 12:11 PM Sean Christopherson [off-list ref] wrote:
On Thu, Feb 23, 2023, Yu Zhao wrote:quoted
On Thu, Feb 23, 2023 at 10:43 AM Sean Christopherson [off-list ref] wrote:quoted
On Thu, Feb 16, 2023, Yu Zhao wrote:quoted
kswapd (MGLRU before) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.06% evict_folios 97.41% shrink_folio_list 31.33% folio_referenced 31.06% rmap_walk_file 30.89% folio_referenced_one 20.83% __mmu_notifier_clear_flush_young 20.54% kvm_mmu_notifier_clear_flush_young => 19.34% _raw_write_lock kswapd (MGLRU after) 100.00% balance_pgdat 100.00% shrink_node 100.00% shrink_one 99.97% try_to_shrink_lruvec 99.51% evict_folios 71.70% shrink_folio_list 7.08% folio_referenced 6.78% rmap_walk_file 6.72% folio_referenced_one 5.60% lru_gen_look_around => 1.53% __mmu_notifier_test_clear_youngDo you happen to know how much of the improvement is due to batching, and how much is due to using a walkless walk?No. I have three benchmarks running at the moment: 1. Windows SQL server guest on x86 host, 2. Apache Spark guest on arm64 host, and 3. Memcached guest on ppc64 host. If you are really interested in that, I can reprioritize -- I need to stop 1) and use that machine to get the number for you.After looking at the "MGLRU before" stack again, it's definitely worth getting those numbers. The "before" isn't just taking mmu_lock, it's taking mmu_lock for write _and_ flushing remote TLBs on _every_ PTE.
Correct.
I suspect the batching is a tiny percentage of the overall win (might be larger with RETPOLINE and friends),
Same here.
and that the bulk of the improvement comes from avoiding the insanity of kvm_mmu_notifier_clear_flush_young(). Speaking of which, what would it take to drop mmu_notifier_clear_flush_young() entirely?
That's not my call :) Adding Johannes.
I.e. why can MGLRU tolerate stale information but !MGLRU cannot?
Good question. The native clear API doesn't flush:
int ptep_clear_flush_young(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep)
{
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
return ptep_test_and_clear_young(vma, address, ptep);
}
If we simply deleted mmu_notifier_clear_flush_young() and used mmu_notifier_clear_young() instead, would anyone notice, let alone care?
I tend to agree.
quoted
quoted
quoted
@@ -5699,6 +5797,9 @@ static ssize_t show_enabled(struct kobject *kobj, struct kobj_attribute *attr, c if (arch_has_hw_nonleaf_pmd_young() && get_cap(LRU_GEN_NONLEAF_YOUNG)) caps |= BIT(LRU_GEN_NONLEAF_YOUNG); + if (kvm_arch_has_test_clear_young() && get_cap(LRU_GEN_SPTE_WALK)) + caps |= BIT(LRU_GEN_SPTE_WALK);As alluded to in patch 1, unless batching the walks even if KVM does _not_ support a lockless walk is somehow _worse_ than using the existing mmu_notifier_clear_flush_young(), I think batching the calls should be conditional only on LRU_GEN_SPTE_WALK. Or if we want to avoid batching when there are no mmu_notifier listeners, probe mmu_notifiers. But don't call into KVM directly.I'm not sure I fully understand. Let's present the problem on the MM side: assuming KVM supports lockless walks, batching can still be worse (very unlikely), because GFNs can exhibit no memory locality at all. So this option allows userspace to disable batching.I'm asking the opposite. Is there a scenario where batching+lock is worse than !batching+lock? If not, then don't make batching depend on lockless walks.
Yes, absolutely. batching+lock means we take/release mmu_lock for every single PTE in the entire VA space -- each small batch contains 64 PTEs but the entire batch is the whole KVM.
quoted
I fully understand why you don't want MM to call into KVM directly. No acceptable ways to set up a clear interface between MM and KVM other than the MMU notifier?There are several options I can think of, but before we go spend time designing the best API, I'd rather figure out if we care in the first place.
This is self serving -- MGLRU would be the only user in the near future. But I never assume there will be no common ground, at least it doesn't hurt to check.