[PATCH 1/2] kvm: Fix mmu_notifier release race

From: Suzuki.Poulose@arm.com (Suzuki K Poulose)
Date: 2017-05-03 13:13:11
Also in: kvm, kvmarm, lkml

On 28/04/17 18:20, Suzuki K Poulose wrote:

On 26/04/17 17:03, Suzuki K Poulose wrote:

quoted

On 25/04/17 19:49, Radim Kr?m?? wrote:

quoted

2017-04-24 11:10+0100, Suzuki K Poulose:

quoted

The KVM uses mmu_notifier (wherever available) to keep track
of the changes to the mm of the guest. The guest shadow page
tables are released when the VM exits via mmu_notifier->ops.release().
There is a rare chance that the mmu_notifier->release could be
called more than once via two different paths, which could end
up in use-after-free of kvm instance (such as [0]).

e.g:

thread A                                        thread B
-------                                         --------------

 get_signal->                                   kvm_destroy_vm()->
 do_exit->                                        mmu_notifier_unregister->
 exit_mm->                                        kvm_arch_flush_shadow_all()->
 exit_mmap->                                      spin_lock(&kvm->mmu_lock)
 mmu_notifier_release->                           ....
  kvm_arch_flush_shadow_all()->                   .....
  ... spin_lock(&kvm->mmu_lock)                   .....
                                                  spin_unlock(&kvm->mmu_lock)
                                                kvm_arch_free_kvm()
   *** use after free of kvm ***

I don't understand this race ...
a piece of code in mmu_notifier_unregister() says:

      /*
       * Wait for any running method to finish, of course including
       * ->release if it was run by mmu_notifier_release instead of us.
       */
      synchronize_srcu(&srcu);

and code before that removes the notifier from the list, so it cannot be
called after we pass this point.  mmu_notifier_release() does roughly
the same and explains it as:

      /*
       * synchronize_srcu here prevents mmu_notifier_release from returning to
       * exit_mmap (which would proceed with freeing all pages in the mm)
       * until the ->release method returns, if it was invoked by
       * mmu_notifier_unregister.
       *
       * The mmu_notifier_mm can't go away from under us because one mm_count
       * is held by exit_mmap.
       */
      synchronize_srcu(&srcu);

The call of mmu_notifier->release is protected by srcu in both cases and
while it seems possible that mmu_notifier->release would be called
twice, I don't see a combination that could result in use-after-free
from mmu_notifier_release after mmu_notifier_unregister() has returned.

Thanks for bringing it up. Even I am wondering why this is triggered ! (But it
does get triggered for sure !!)

The only difference I can spot with _unregister & _release paths are the way
we use src_read_lock across the deletion of the entry from the list.

In mmu_notifier_unregister() we do :

                id = srcu_read_lock(&srcu);
                /*
                 * exit_mmap will block in mmu_notifier_release to guarantee
                 * that ->release is called before freeing the pages.
                 */
                if (mn->ops->release)
                        mn->ops->release(mn, mm);
                srcu_read_unlock(&srcu, id);

## Releases the srcu lock here and then goes on to grab the spin_lock.

                spin_lock(&mm->mmu_notifier_mm->lock);
                /*
                 * Can not use list_del_rcu() since __mmu_notifier_release
                 * can delete it before we hold the lock.
                 */
                hlist_del_init_rcu(&mn->hlist);
                spin_unlock(&mm->mmu_notifier_mm->lock);

While in mmu_notifier_release() we hold it until the node(s) are deleted from the
list :
        /*
         * SRCU here will block mmu_notifier_unregister until
         * ->release returns.
         */
        id = srcu_read_lock(&srcu);
        hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
                /*
                 * If ->release runs before mmu_notifier_unregister it must be
                 * handled, as it's the only way for the driver to flush all
                 * existing sptes and stop the driver from establishing any more
                 * sptes before all the pages in the mm are freed.
                 */
                if (mn->ops->release)
                        mn->ops->release(mn, mm);

        spin_lock(&mm->mmu_notifier_mm->lock);
        while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
                mn = hlist_entry(mm->mmu_notifier_mm->list.first,
                                 struct mmu_notifier,
                                 hlist);
                /*
                 * We arrived before mmu_notifier_unregister so
                 * mmu_notifier_unregister will do nothing other than to wait
                 * for ->release to finish and for mmu_notifier_unregister to
                 * return.
                 */
                hlist_del_init_rcu(&mn->hlist);
        }
        spin_unlock(&mm->mmu_notifier_mm->lock);
        srcu_read_unlock(&srcu, id);

## The lock is release only after the deletion of the node.

Both are followed by a synchronize_srcu(). Now, I am wondering if the unregister path
could potentially miss SRCU read lock held in _release() path and go onto finish the
synchronize_srcu before the item is deleted ? May be we should do the read_unlock
after the deletion of the node in _unregister (like we do in the _release()) ?

I haven't been able to reproduce the mmu_notifier race condition, which leads to KVM
free, reported at [1]. I will leave it running (with tracepoints/ftrace) over the
weekend.

I couldn't reproduce the proposed "mmu_notifier race" reported in [0].
However I found some other use-after-free cases in the unmap_stage2_range()
code due to the introduction of cond_resched_lock(). It may be just that the
IP reported in [0] was for wrong line of code ? i.e, arch_spin_is_locked instead
of unmap_stage2_range ?
Anyways, I will send a new version of the patches in a separate series.

[0] https://marc.info/?l=linux-kernel&m=149201399018791&w=2

Suzuki

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help