Re: [PATCH] net/virtio-user: fix memory hotplug support

From: Sean Harte <hidden>
Date: 2018-08-23 15:33:30

On Thu, 23 Aug 2018 at 15:01, Burakov, Anatoly
[off-list ref] wrote:

On 23-Aug-18 12:19 PM, Sean Harte wrote:

quoted

On Thu, 23 Aug 2018 at 10:05, Burakov, Anatoly
[off-list ref] wrote:

quoted

On 23-Aug-18 3:57 AM, Tiwei Bie wrote:

quoted

Deadlock can occur when allocating memory if a vhost-kernel
based virtio-user device is in use. Besides, it's possible
to have much more than 64 non-contiguous hugepage backed
memory regions due to the memory hotplug, which may cause
problems when handling VHOST_SET_MEM_TABLE request. A better
solution is to have the virtio-user pass all the VA ranges
reserved by DPDK to vhost-kernel.

Bugzilla ID: 81
Fixes: 12ecb2f63b12 ("net/virtio-user: support memory hotplug")
Cc: stable@dpdk.org

Reported-by: Seán Harte <redacted>
Signed-off-by: Tiwei Bie <redacted>
---
   drivers/net/virtio/virtio_user/vhost_kernel.c | 64 ++++++++-----------
   1 file changed, 27 insertions(+), 37 deletions(-)

diff --git a/drivers/net/virtio/virtio_user/vhost_kernel.c b/drivers/net/virtio/virtio_user/vhost_kernel.c
index b2444096c..49bd1b821 100644
--- a/drivers/net/virtio/virtio_user/vhost_kernel.c
+++ b/drivers/net/virtio/virtio_user/vhost_kernel.c

@@ -70,41 +70,12 @@ static uint64_t vhost_req_user_to_kernel[] = {
       [VHOST_USER_SET_MEM_TABLE] = VHOST_SET_MEM_TABLE,
   };

-struct walk_arg {
-     struct vhost_memory_kernel *vm;
-     uint32_t region_nr;
-};
-static int
-add_memory_region(const struct rte_memseg_list *msl __rte_unused,
-             const struct rte_memseg *ms, size_t len, void *arg)
-{
-     struct walk_arg *wa = arg;
-     struct vhost_memory_region *mr;
-     void *start_addr;
-
-     if (wa->region_nr >= max_regions)
-             return -1;
-
-     mr = &wa->vm->regions[wa->region_nr++];
-     start_addr = ms->addr;
-
-     mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr;
-     mr->userspace_addr = (uint64_t)(uintptr_t)start_addr;
-     mr->memory_size = len;
-     mr->mmap_offset = 0;
-
-     return 0;
-}
-
-/* By default, vhost kernel module allows 64 regions, but DPDK allows
- * 256 segments. As a relief, below function merges those virtually
- * adjacent memsegs into one region.
- */
   static struct vhost_memory_kernel *
   prepare_vhost_memory_kernel(void)
   {
+     struct rte_mem_config *mcfg = rte_eal_get_configuration()->mem_config;
       struct vhost_memory_kernel *vm;
-     struct walk_arg wa;
+     uint32_t region_nr = 0, i;

       vm = malloc(sizeof(struct vhost_memory_kernel) +
                       max_regions *

@@ -112,15 +83,34 @@ prepare_vhost_memory_kernel(void)
       if (!vm)
               return NULL;

-     wa.region_nr = 0;
-     wa.vm = vm;
+     for (i = 0; i < RTE_MAX_MEMSEG_LISTS; i++) {
+             struct rte_memseg_list *msl = &mcfg->memsegs[i];
+             struct vhost_memory_region *mr;
+             void *start_addr;
+             uint64_t len;

There is a rte_memseg_list_walk() - please do not iterate over memseg
lists manually.

rte_memseg_list_walk() can't be used here because
prepare_vhost_memory_kernel() is sometimes called from a memory
callback. It will then hang trying to get a read lock on
memory_hotplug_lock.

OK, so use rte_memseg_list_walk_thread_unsafe().

quoted

I don't think the rte_memseg_list_walk_thread_unsafe() function is
appropriate because prepare_vhost_memory_kernel() may not always be
called from a memory callback.

And how is this different? What you're doing here is identical to
calling rte_memseg_list_walk_thread_unsafe() (that's precisely what it
does internally - check the code!), except that you're doing it manually
and not using DPDK API, which makes your code dependent on internals of
DPDK's memory implementation.

So, this function may or may not be called from a callback, but you're
using thread-unsafe walk anyway. I think you should call either
thread-safe or thread-unsafe version depending on whether you're being
called from a callback or not.

I'm not the patch author :-) Switching to the _thread_unsafe()
function might convert the easy-to-debug deadlock issue into a
tricky-to-debug race condition. Picking between the two
implementations depending on the calling context would work. Another
approach might be a different type of locking system that allows
reading when the same thread already holds the write lock.

quoted

-     if (rte_memseg_contig_walk(add_memory_region, &wa) < 0) {
-             free(vm);
-             return NULL;
+             start_addr = msl->base_va;
+             len = msl->page_sz * msl->memseg_arr.len;
+
+             if (start_addr == NULL || len == 0)
+                     continue;
+
+             if (region_nr >= max_regions) {
+                     free(vm);
+                     return NULL;
+             }
+
+             mr = &vm->regions[region_nr++];
+             mr->guest_phys_addr = (uint64_t)(uintptr_t)start_addr;
+             mr->userspace_addr = (uint64_t)(uintptr_t)start_addr;
+             mr->memory_size = len;
+             mr->mmap_offset = 0; /* flags_padding */
+
+             PMD_DRV_LOG(DEBUG, "index=%u, addr=%p len=%" PRIu64,
+                             i, start_addr, len);
       }

-     vm->nregions = wa.region_nr;
+     vm->nregions = region_nr;
       vm->padding = 0;
       return vm;
   }


--
Thanks,
Anatoly


--
Thanks,
Anatoly

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help