Thread (44 messages) 44 messages, 4 authors, 2021-03-31

Re: [PATCH 10/18] KVM: Move x86's MMU notifier memslot walkers to generic code

From: Sean Christopherson <seanjc@google.com>
Date: 2021-03-31 16:21:31
Also in: kvm, kvmarm, linux-arm-kernel, lkml

On Wed, Mar 31, 2021, Paolo Bonzini wrote:
On 26/03/21 03:19, Sean Christopherson wrote:
quoted
+#ifdef KVM_ARCH_WANT_NEW_MMU_NOTIFIER_APIS
+	kvm_handle_hva_range(mn, address, address + 1, pte, kvm_set_spte_gfn);
+#else
  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
  	int idx;
 	trace_kvm_set_spte_hva(address);
	idx = srcu_read_lock(&kvm->srcu);

	KVM_MMU_LOCK(kvm);

	kvm->mmu_notifier_seq++;

	if (kvm_set_spte_hva(kvm, address, pte))
		kvm_flush_remote_tlbs(kvm);

  	KVM_MMU_UNLOCK(kvm);
  	srcu_read_unlock(&kvm->srcu, idx);
+#endif
The kvm->mmu_notifier_seq is missing in the new API side.  I guess you can
add an argument to __kvm_handle_hva_range and handle it also in patch 15
("KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
memslot").
Yikes.  Superb eyes!

That does bring up an oddity I discovered when digging into this.  Every call
to .change_pte() is bookended by .invalidate_range_{start,end}(), i.e. the above
missing kvm->mmu_notifier_seq++ is benign because kvm->mmu_notifier_count is
guaranteed to be non-zero.

I'm also fairly certain it means kvm_set_spte_gfn() is effectively dead code on
_all_ architectures.  x86 and MIPS are clearcut nops if the old SPTE is
not-present, and that's guaranteed due to the prior invalidation.  PPC simply
unmaps the SPTE, which again should be a nop due to the invalidation.  arm64 is
a bit murky, but if I'm reading the code correctly, it's also a nop because
kvm_pgtable_stage2_map() is called without a cache pointer, which I think means
it will map an entry if and only if an existing PTE was found.

I haven't actually tested the above analysis, e.g. by asserting that
kvm->mmu_notifier_count is indeed non-zero.  I'll do that sooner than later.
But, given the shortlog of commit:

  6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify with invalidate_range_start
                 and invalidate_range_end")

I'm fairly confident my analysis is correct.  And if so, it also means that the
whole point of adding .change_pte() in the first place (for KSM, commit
828502d30073, "ksm: add mmu_notifier set_pte_at_notify()"), has since been lost.

When it was originally added, .change_pte() was a pure alternative to
invalidating the entry.

  void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
                               pte_t pte)
  {
        struct mmu_notifier *mn;
        struct hlist_node *n;

        rcu_read_lock();
        hlist_for_each_entry_rcu(mn, n, &mm->mmu_notifier_mm->list, hlist) {
                if (mn->ops->change_pte)
                        mn->ops->change_pte(mn, mm, address, pte);
                /*
                 * Some drivers don't have change_pte,
                 * so we must call invalidate_page in that case.
                 */
                else if (mn->ops->invalidate_page)
                        mn->ops->invalidate_page(mn, mm, address);
        }
        rcu_read_unlock();
  }

The aforementioned commit 6bdb913f0a70 wrapped set_pte_at_notify() with
invalidate_range_{start,end}() so that .invalidate_page() implementations could
sleep.  But, no one noticed that in doing so, .change_pte() was completely
neutered.

Assuming all of the above is correct, I'm very tempted to rip out .change_pte()
entirely.  It's been dead weight for 8+ years and no one has complained about
KSM+KVM performance (I'd also be curious to know how much performance was gained
by shaving VM-Exits).  As KVM is the only user of .change_pte(), dropping it in
KVM would mean the entire MMU notifier could also go away.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help