Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost

From: Bruce Richardson <hidden>
Date: 2021-07-07 15:00:19

On Wed, Jul 07, 2021 at 03:33:44PM +0100, Burakov, Anatoly wrote:

On 07-Jul-21 1:54 PM, Ding, Xuan wrote:

quoted

Hi Anatoly,

quoted

-----Original Message-----
From: Burakov, Anatoly <redacted>
Sent: Wednesday, July 7, 2021 8:18 PM
To: Ding, Xuan <redacted>; Maxime Coquelin
[off-list ref]; Xia, Chenbo [off-list ref];
Thomas Monjalon [off-list ref]; David Marchand
[off-list ref]
Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil
[off-list ref]; Richardson, Bruce [off-list ref]; Van
Haaren, Harry [off-list ref]; Liu, Yong [off-list ref];
Ma, WenwuX [off-list ref]
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost

On 07-Jul-21 7:25 AM, Ding, Xuan wrote:

quoted

Hi,

quoted

-----Original Message-----
From: Maxime Coquelin <redacted>
Sent: Tuesday, July 6, 2021 5:32 PM
To: Burakov, Anatoly <redacted>; Ding, Xuan
[off-list ref]; Xia, Chenbo [off-list ref]; Thomas
Monjalon [off-list ref]; David Marchand
[off-list ref]
Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil
[off-list ref]; Richardson, Bruce [off-list ref];

Van

quoted

Haaren, Harry [off-list ref]; Liu, Yong

[off-list ref];

quoted

Ma, WenwuX [off-list ref]
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost



On 7/6/21 11:16 AM, Burakov, Anatoly wrote:

quoted

On 06-Jul-21 9:31 AM, Ding, Xuan wrote:

quoted

Hi Maxime,

quoted

-----Original Message-----
From: Maxime Coquelin <redacted>
Sent: Monday, July 5, 2021 8:46 PM
To: Burakov, Anatoly <redacted>; Ding, Xuan
[off-list ref]; Xia, Chenbo [off-list ref]; Thomas
Monjalon [off-list ref]; David Marchand
[off-list ref]
Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil
[off-list ref]; Richardson, Bruce

[off-list ref];

quoted

Van Haaren, Harry [off-list ref]; Liu, Yong
[off-list ref]; Ma, WenwuX [off-list ref]
Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async
vhost



On 7/5/21 2:16 PM, Burakov, Anatoly wrote:

quoted

On 05-Jul-21 9:40 AM, Xuan Ding wrote:

quoted

The use of IOMMU has many advantages, such as isolation and address
translation. This patch extends the capbility of DMA engine to use
IOMMU if the DMA device is bound to vfio.

When set memory table, the guest memory will be mapped
into the default container of DPDK.

Signed-off-by: Xuan Ding <redacted>
---
     doc/guides/prog_guide/vhost_lib.rst |  9 ++++++
     lib/vhost/rte_vhost.h               |  1 +
     lib/vhost/socket.c                  |  9 ++++++
     lib/vhost/vhost.h                   |  1 +
     lib/vhost/vhost_user.c              | 46
++++++++++++++++++++++++++++-
     5 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/vhost_lib.rst

b/doc/guides/prog_guide/vhost_lib.rst
index 05c42c9b11..c3beda23d9 100644

--- a/doc/guides/prog_guide/vhost_lib.rst
+++ b/doc/guides/prog_guide/vhost_lib.rst

@@ -118,6 +118,15 @@ The following is an overview of some key

Vhost

quoted

API functions:
           It is disabled by default.
     +  - ``RTE_VHOST_USER_ASYNC_USE_VFIO``
+
+    In asynchronous data path, vhost liarary is not aware of which
driver
+    (igb_uio/vfio) the DMA device is bound to. Application should
pass
+    this flag to tell vhost library whether IOMMU should be
programmed
+    for guest memory.
+
+    It is disabled by default.
+
       - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS``
           Since v16.04, the vhost library forwards checksum and gso
requests for

diff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h
index 8d875e9322..a766ea7b6b 100644
--- a/lib/vhost/rte_vhost.h
+++ b/lib/vhost/rte_vhost.h

@@ -37,6 +37,7 @@ extern "C" {
     #define RTE_VHOST_USER_LINEARBUF_SUPPORT    (1ULL << 6)
     #define RTE_VHOST_USER_ASYNC_COPY    (1ULL << 7)
     #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS    (1ULL <<

8)

quoted

+#define RTE_VHOST_USER_ASYNC_USE_VFIO    (1ULL << 9)
       /* Features. */
     #ifndef VIRTIO_NET_F_GUEST_ANNOUNCE

diff --git a/lib/vhost/socket.c b/lib/vhost/socket.c
index 5d0d728d52..77c722c86b 100644
--- a/lib/vhost/socket.c
+++ b/lib/vhost/socket.c

@@ -42,6 +42,7 @@ struct vhost_user_socket {
         bool extbuf;
         bool linearbuf;
         bool async_copy;
+    bool async_use_vfio;
         bool net_compliant_ol_flags;
           /*

@@ -243,6 +244,13 @@ vhost_user_add_connection(int fd, struct

vhost_user_socket *vsocket)
                 dev->async_copy = 1;
         }
     +    if (vsocket->async_use_vfio) {
+        dev = get_device(vid);
+
+        if (dev)
+            dev->async_use_vfio = 1;
+    }
+
         VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n", vid);
           if (vsocket->notify_ops->new_connection) {

@@ -879,6 +887,7 @@ rte_vhost_driver_register(const char *path,

uint64_t flags)
         vsocket->extbuf = flags & RTE_VHOST_USER_EXTBUF_SUPPORT;
         vsocket->linearbuf = flags &

RTE_VHOST_USER_LINEARBUF_SUPPORT;

quoted

         vsocket->async_copy = flags & RTE_VHOST_USER_ASYNC_COPY;
+    vsocket->async_use_vfio = flags &

RTE_VHOST_USER_ASYNC_USE_VFIO;

quoted

         vsocket->net_compliant_ol_flags = flags &
RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS;
           if (vsocket->async_copy &&

diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h
index 8078ddff79..fb775ce4ed 100644
--- a/lib/vhost/vhost.h
+++ b/lib/vhost/vhost.h

@@ -370,6 +370,7 @@ struct virtio_net {
         int16_t            broadcast_rarp;
         uint32_t        nr_vring;
         int            async_copy;
+    int            async_use_vfio;
         int            extbuf;
         int            linearbuf;
         struct vhost_virtqueue    *virtqueue[VHOST_MAX_QUEUE_PAIRS *

2];

diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c
index 8f0eba6412..f3703f2e72 100644
--- a/lib/vhost/vhost_user.c
+++ b/lib/vhost/vhost_user.c

@@ -45,6 +45,7 @@
     #include <rte_common.h>
     #include <rte_malloc.h>
     #include <rte_log.h>
+#include <rte_vfio.h>
       #include "iotlb.h"
     #include "vhost.h"

@@ -141,6 +142,36 @@ get_blk_size(int fd)
         return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize;
     }
     +static int
+async_dma_map(struct rte_vhost_mem_region *region, bool do_map)
+{
+    int ret = 0;
+    uint64_t host_iova;
+    host_iova = rte_mem_virt2iova((void

*)(uintptr_t)region->host_user_addr);
+    if (do_map) {
+        /* Add mapped region into the default container of DPDK. */
+        ret =

rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,

quoted

+                         region->host_user_addr,
+                         host_iova,
+                         region->size);
+        if (ret) {
+            VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n");
+            return ret;
+        }
+    } else {
+        /* Remove mapped region from the default container of
DPDK. */
+        ret =
rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,
+                           region->host_user_addr,
+                           host_iova,
+                           region->size);
+        if (ret) {
+            VHOST_LOG_CONFIG(ERR, "DMA engine unmap failed\n");
+            return ret;
+        }
+    }
+    return ret;
+}

We've been discussing this off list with Xuan, and unfortunately
this is
a blocker for now.

Currently, the x86 IOMMU does not support partial unmap - the

segments

quoted

quoted
quoted
quoted
quoted
quoted
have to be unmapped exactly the same addr/len as they were mapped.

We

quoted

also concatenate adjacent mappings to prevent filling up the DMA
mapping
entry table with superfluous entries.

This means that, when two unrelated mappings are contiguous in

memory

quoted

(e.g. if you map regions 1 and 2 independently, but they happen to be
sitting right next to each other in virtual memory), we cannot later
unmap one of them because, even though these are two separate

mappings

quoted

as far as kernel VFIO infrastructure is concerned, the mapping gets
compacted and looks like one single mapping to VFIO, so DPDK API will
not let us unmap region 1 without also unmapping region 2.

The proper fix for this problem would be to always map memory
page-by-page regardless of where it comes from (we already do that for
internal memory, but not for external). However, the reason this works
for internal memory is because when mapping internal memory

segments,

quoted

quoted
quoted
quoted
quoted
quoted
*we know the page size*. For external memory segments, there is no

such

quoted

guarantee, so we cannot deduce page size for a given memory segment,

and

quoted

thus can't map things page-by-page.

So, the proper fix for it would be to add page size to the VFIO DMA
API.
Unfortunately, it probably has to wait until 21.11 because it is an API
change.

The slightly hacky fix for this would be to forego user mem map
concatenation and trust that user is not going to do anything stupid,
and will not spam the VFIO DMA API without reason. I would rather
not go
down this road, but this could be an option in this case.

Thoughts?

Thanks Anatoly for the detailed description of the issue.
It may be possible to either create a versioned symbol for this API
change, or maybe even to have a temporary internal API.

But I think this series in its current form is not acceptable, so
waiting for v21.11 would be the best option (we may want to send the
deprecation notice in this release though).

In this series, I don't like the user application has to pass a flag to
state whether the DMA engine uses VFIO or not. AFAICT, this new revision
does not implement what was discussed in the previous one, i.e.
supporting both IOVA_AS_VA and IOVA_AS_PA.

Thanks for your comments. Here I hope to explain some questions:
1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now?
A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In this version,

the

quoted

virtual address is replaced with iova address of mapped region, and
the iova
address is selected to program the IOMMU instead of virtual address only.

Good!

quoted

2. Why a flag is chosen to be passed by application?
A: Yes, as we discussed before, the rte_eal_iova_mode() API can be
used to
get the IOVA mode, so as to determine whether IOMMU should be

programmed.

quoted

However, in the implementation process, I found a problem. That is how to
distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, we

should

quoted

always program the IOMMU. While in IGB_UIO cases, it depends on

IOMMU

quoted

capability of platform.

How does one program IOMMU with igb_uio? I was under impression that
igb_uio (and uio_pci_generic for that matter) does not provide such
facilities.

+1

Maybe some misunderstanding in this sentence here.
In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will be

programmed.

quoted

True means vfio module is modprobed.

But there is an exception here, that is, even if vfio module is modprobed,
DPDK user still bind all the devices to igb_uio.

This situation can be distinguished in DPDK eal initialization, because the

resource mapping

quoted

is according to the driver loaded by each device(rte_pci_map_device).

But in our scenario, this judgment is somewhat weak. Because we cannot get
the device driver info in vhost library. I also think it is unreasonable for vhost to
do this. Only trust that users will not use it like this. Thoughts for this scenario?

I don't see how igb_uio would make any difference at all. If you are
using igb_uio, you *don't have DMA mapping at all* and will use raw
physical addresses. Assuming your code supports this, that's all you're
ever going to get. The point of VFIO is to have memory regions that are
mapped for DMA *because real physical addresses are assumed to be not
available*. When you're using igb_uio, you effectively do have DMA
access to the entire memory, and thus can bypass IOMMU altogether
(assuming you're using passthrough mode).

My concern is exactly here.
In igb_uio cases, although devices are not added to the default container in eal init,
but the "IOMMU programming" actually happens when the rte_vfio_container_dma_map() is called.
It is no harm but it is also unnecessary.

Yes, it is unnecessary, but it's also not actively harmful, which means you
can still do it without any regard as to whether you do or don't have IOMMU
:)

Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some will
have igb_uio. The igb_uio-bound NICs will not care if you have mapped
anything for DMA because they don't go through IOMMU, things will "just
work". The VFIO-bound NICs will get the memory mapped, because they are the
ones who actually need the DMA mapping.

Do we even support a hybrid setup. I would have thought that was just
asking for problems and should be considered an unsupported configuration?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help