Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

From: "Michael S. Tsirkin" <mst@redhat.com>
Date: 2020-10-14 06:54:50
Also in: lkml, virtualization

On Tue, Oct 13, 2020 at 04:42:59PM -0700, si-wei liu wrote:

On 10/9/2020 7:27 PM, Jason Wang wrote:

quoted

On 2020/10/3 下午1:02, Si-Wei Liu wrote:

quoted

Pinned pages are not properly accounted particularly when
mapping error occurs on IOTLB update. Clean up dangling
pinned pages for the error path. As the inflight pinned
pages, specifically for memory region that strides across
multiple chunks, would need more than one free page for
book keeping and accounting. For simplicity, pin pages
for all memory in the IOVA range in one go rather than
have multiple pin_user_pages calls to make up the entire
region. This way it's easier to track and account the
pages already mapped, particularly for clean-up in the
error path.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu <redacted>
---
Changes in v3:
- Factor out vhost_vdpa_map() change to a separate patch

Changes in v2:
- Fix incorrect target SHA1 referenced

  drivers/vhost/vdpa.c | 119
++++++++++++++++++++++++++++++---------------------
  1 file changed, 71 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 0f27919..dad41dae 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c

@@ -595,21 +595,19 @@ static int

vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
      struct vhost_dev *dev = &v->vdev;
      struct vhost_iotlb *iotlb = dev->iotlb;
      struct page **page_list;
-    unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
+    struct vm_area_struct **vmas;
      unsigned int gup_flags = FOLL_LONGTERM;
-    unsigned long npages, cur_base, map_pfn, last_pfn = 0;
-    unsigned long locked, lock_limit, pinned, i;
+    unsigned long map_pfn, last_pfn = 0;
+    unsigned long npages, lock_limit;
+    unsigned long i, nmap = 0;
      u64 iova = msg->iova;
+    long pinned;
      int ret = 0;
        if (vhost_iotlb_itree_first(iotlb, msg->iova,
                      msg->iova + msg->size - 1))
          return -EEXIST;
  -    page_list = (struct page **) __get_free_page(GFP_KERNEL);
-    if (!page_list)
-        return -ENOMEM;
-
      if (msg->perm & VHOST_ACCESS_WO)
          gup_flags |= FOLL_WRITE;
  @@ -617,61 +615,86 @@ static int
vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
      if (!npages)
          return -EINVAL;
  +    page_list = kvmalloc_array(npages, sizeof(struct page *),
GFP_KERNEL);
+    vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
+                  GFP_KERNEL);


This will result high order memory allocation which was what the code
tried to avoid originally.

Using an unlimited size will cause a lot of side effects consider VM or
userspace may try to pin several TB of memory.

Hmmm, that's a good point. Indeed, if the guest memory demand is huge or the
host system is running short of free pages, kvmalloc will be problematic and
less efficient than the __get_free_page implementation.

OK so ... Jason, what's the plan?

How about you send a patchset with
1. revert this change
2. fix error handling leak

quoted

+    if (!page_list || !vmas) {
+        ret = -ENOMEM;
+        goto free;
+    }


Any reason that you want to use vmas?

Without providing custom vmas, it's subject to high order allocation
failure. While page_list and vmas can now fallback to virtual memory
allocation if need be.

quoted

+
      mmap_read_lock(dev->mm);
  -    locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
      lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-    if (locked > lock_limit) {
+    if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
          ret = -ENOMEM;
-        goto out;
+        goto unlock;
      }
  -    cur_base = msg->uaddr & PAGE_MASK;
-    iova &= PAGE_MASK;
+    pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
+                page_list, vmas);
+    if (npages != pinned) {
+        if (pinned < 0) {
+            ret = pinned;
+        } else {
+            unpin_user_pages(page_list, pinned);
+            ret = -ENOMEM;
+        }
+        goto unlock;
+    }
  -    while (npages) {
-        pinned = min_t(unsigned long, npages, list_size);
-        ret = pin_user_pages(cur_base, pinned,
-                     gup_flags, page_list, NULL);
-        if (ret != pinned)
-            goto out;
-
-        if (!last_pfn)
-            map_pfn = page_to_pfn(page_list[0]);
-
-        for (i = 0; i < ret; i++) {
-            unsigned long this_pfn = page_to_pfn(page_list[i]);
-            u64 csize;
-
-            if (last_pfn && (this_pfn != last_pfn + 1)) {
-                /* Pin a contiguous chunk of memory */
-                csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-                if (vhost_vdpa_map(v, iova, csize,
-                           map_pfn << PAGE_SHIFT,
-                           msg->perm))
-                    goto out;
-                map_pfn = this_pfn;
-                iova += csize;
+    iova &= PAGE_MASK;
+    map_pfn = page_to_pfn(page_list[0]);
+
+    /* One more iteration to avoid extra vdpa_map() call out of
loop. */
+    for (i = 0; i <= npages; i++) {
+        unsigned long this_pfn;
+        u64 csize;
+
+        /* The last chunk may have no valid PFN next to it */
+        this_pfn = i < npages ? page_to_pfn(page_list[i]) : -1UL;
+
+        if (last_pfn && (this_pfn == -1UL ||
+                 this_pfn != last_pfn + 1)) {
+            /* Pin a contiguous chunk of memory */
+            csize = last_pfn - map_pfn + 1;
+            ret = vhost_vdpa_map(v, iova, csize << PAGE_SHIFT,
+                         map_pfn << PAGE_SHIFT,
+                         msg->perm);
+            if (ret) {
+                /*
+                 * Unpin the rest chunks of memory on the
+                 * flight with no corresponding vdpa_map()
+                 * calls having been made yet. On the other
+                 * hand, vdpa_unmap() in the failure path
+                 * is in charge of accounting the number of
+                 * pinned pages for its own.
+                 * This asymmetrical pattern of accounting
+                 * is for efficiency to pin all pages at
+                 * once, while there is no other callsite
+                 * of vdpa_map() than here above.
+                 */
+                unpin_user_pages(&page_list[nmap],
+                         npages - nmap);
+                goto out;
              }
-
-            last_pfn = this_pfn;
+            atomic64_add(csize, &dev->mm->pinned_vm);
+            nmap += csize;
+            iova += csize << PAGE_SHIFT;
+            map_pfn = this_pfn;
          }
-
-        cur_base += ret << PAGE_SHIFT;
-        npages -= ret;
+        last_pfn = this_pfn;
      }


So what I suggest is to fix the pinning leakage first and do the
possible optimization on top (which is still questionable to me).

OK. Unfortunately, this was picked and got merged in upstream. So I will
post a follow up patch set to 1) revert the commit to the original
__get_free_page() implementation, and 2) fix the accounting and leakage on
top. Will it be fine?


-Siwei

quoted

Thanks

quoted

  -    /* Pin the rest chunk */
-    ret = vhost_vdpa_map(v, iova, (last_pfn - map_pfn + 1) <<
PAGE_SHIFT,
-                 map_pfn << PAGE_SHIFT, msg->perm);
+    WARN_ON(nmap != npages);
  out:
-    if (ret) {
+    if (ret)
          vhost_vdpa_unmap(v, msg->iova, msg->size);
-        atomic64_sub(npages, &dev->mm->pinned_vm);
-    }
+unlock:
      mmap_read_unlock(dev->mm);
-    free_page((unsigned long)page_list);
+free:
+    kvfree(vmas);
+    kvfree(page_list);
      return ret;
  }

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help