Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost
From: Ding, Xuan <hidden>
Date: 2021-07-07 15:09:15
-----Original Message----- From: Burakov, Anatoly <redacted> Sent: Wednesday, July 7, 2021 10:34 PM To: Ding, Xuan <redacted>; Maxime Coquelin [off-list ref]; Xia, Chenbo [off-list ref]; Thomas Monjalon [off-list ref]; David Marchand [off-list ref] Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil [off-list ref]; Richardson, Bruce [off-list ref]; Van Haaren, Harry [off-list ref]; Liu, Yong [off-list ref]; Ma, WenwuX [off-list ref] Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for async vhost On 07-Jul-21 1:54 PM, Ding, Xuan wrote:quoted
Hi Anatoly,quoted
-----Original Message----- From: Burakov, Anatoly <redacted> Sent: Wednesday, July 7, 2021 8:18 PM To: Ding, Xuan <redacted>; Maxime Coquelin [off-list ref]; Xia, Chenbo [off-list ref]; Thomas Monjalon [off-list ref]; David Marchand [off-list ref] Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil [off-list ref]; Richardson, Bruce[off-list ref]; Vanquoted
quoted
Haaren, Harry [off-list ref]; Liu, Yong[off-list ref];quoted
quoted
Ma, WenwuX [off-list ref] Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for asyncvhostquoted
quoted
On 07-Jul-21 7:25 AM, Ding, Xuan wrote:quoted
Hi,quoted
-----Original Message----- From: Maxime Coquelin <redacted> Sent: Tuesday, July 6, 2021 5:32 PM To: Burakov, Anatoly <redacted>; Ding, Xuan [off-list ref]; Xia, Chenbo [off-list ref]; Thomas Monjalon [off-list ref]; David Marchand [off-list ref] Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil [off-list ref]; Richardson, Bruce[off-list ref];quoted
quoted
Vanquoted
quoted
Haaren, Harry [off-list ref]; Liu, Yong[off-list ref];quoted
quoted
Ma, WenwuX [off-list ref] Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU for asyncvhostquoted
quoted
quoted
quoted
On 7/6/21 11:16 AM, Burakov, Anatoly wrote:quoted
On 06-Jul-21 9:31 AM, Ding, Xuan wrote:quoted
Hi Maxime,quoted
-----Original Message----- From: Maxime Coquelin <redacted> Sent: Monday, July 5, 2021 8:46 PM To: Burakov, Anatoly <redacted>; Ding, Xuan [off-list ref]; Xia, Chenbo [off-list ref];Thomasquoted
quoted
quoted
quoted
quoted
quoted
quoted
Monjalon [off-list ref]; David Marchand [off-list ref] Cc: dev@dpdk.org; Hu, Jiayu <redacted>; Pai G, Sunil [off-list ref]; Richardson, Bruce[off-list ref];quoted
quoted
quoted
quoted
quoted
Van Haaren, Harry [off-list ref]; Liu, Yong [off-list ref]; Ma, WenwuX [off-list ref] Subject: Re: [dpdk-dev] [PATCH v5 1/2] vhost: enable IOMMU forasyncquoted
quoted
quoted
quoted
quoted
quoted
quoted
vhost On 7/5/21 2:16 PM, Burakov, Anatoly wrote:quoted
On 05-Jul-21 9:40 AM, Xuan Ding wrote:quoted
The use of IOMMU has many advantages, such as isolation andaddressquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
translation. This patch extends the capbility of DMA engine to use IOMMU if the DMA device is bound to vfio. When set memory table, the guest memory will be mapped into the default container of DPDK. Signed-off-by: Xuan Ding <redacted> --- doc/guides/prog_guide/vhost_lib.rst | 9 ++++++ lib/vhost/rte_vhost.h | 1 + lib/vhost/socket.c | 9 ++++++ lib/vhost/vhost.h | 1 + lib/vhost/vhost_user.c | 46 ++++++++++++++++++++++++++++- 5 files changed, 65 insertions(+), 1 deletion(-)diff --git a/doc/guides/prog_guide/vhost_lib.rstb/doc/guides/prog_guide/vhost_lib.rst index 05c42c9b11..c3beda23d9 100644--- a/doc/guides/prog_guide/vhost_lib.rst +++ b/doc/guides/prog_guide/vhost_lib.rst@@ -118,6 +118,15 @@ The following is an overview of somekeyquoted
quoted
Vhostquoted
quoted
quoted
quoted
quoted
quoted
quoted
API functions: It is disabled by default. + - ``RTE_VHOST_USER_ASYNC_USE_VFIO`` + + In asynchronous data path, vhost liarary is not aware of which driver + (igb_uio/vfio) the DMA device is bound to. Application should pass + this flag to tell vhost library whether IOMMU should be programmed + for guest memory. + + It is disabled by default. + - ``RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS`` Since v16.04, the vhost library forwards checksum and gso requests fordiff --git a/lib/vhost/rte_vhost.h b/lib/vhost/rte_vhost.h index 8d875e9322..a766ea7b6b 100644 --- a/lib/vhost/rte_vhost.h +++ b/lib/vhost/rte_vhost.h@@ -37,6 +37,7 @@ extern "C" { #define RTE_VHOST_USER_LINEARBUF_SUPPORT (1ULL << 6) #define RTE_VHOST_USER_ASYNC_COPY (1ULL << 7) #define RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS(1ULL <<quoted
quoted
8)quoted
quoted
quoted
quoted
quoted
quoted
quoted
+#define RTE_VHOST_USER_ASYNC_USE_VFIO (1ULL << 9) /* Features. */ #ifndef VIRTIO_NET_F_GUEST_ANNOUNCEdiff --git a/lib/vhost/socket.c b/lib/vhost/socket.c index 5d0d728d52..77c722c86b 100644 --- a/lib/vhost/socket.c +++ b/lib/vhost/socket.c@@ -42,6 +42,7 @@ struct vhost_user_socket { bool extbuf; bool linearbuf; bool async_copy; + bool async_use_vfio; bool net_compliant_ol_flags; /*@@ -243,6 +244,13 @@ vhost_user_add_connection(int fd,structquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
vhost_user_socket *vsocket) dev->async_copy = 1; } + if (vsocket->async_use_vfio) { + dev = get_device(vid); + + if (dev) + dev->async_use_vfio = 1; + } + VHOST_LOG_CONFIG(INFO, "new device, handle is %d\n",vid);quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
if (vsocket->notify_ops->new_connection) {@@ -879,6 +887,7 @@ rte_vhost_driver_register(const char*path,quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
uint64_t flags) vsocket->extbuf = flags &RTE_VHOST_USER_EXTBUF_SUPPORT;quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
vsocket->linearbuf = flags &RTE_VHOST_USER_LINEARBUF_SUPPORT;quoted
quoted
quoted
quoted
quoted
vsocket->async_copy = flags &RTE_VHOST_USER_ASYNC_COPY;quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+ vsocket->async_use_vfio = flags &RTE_VHOST_USER_ASYNC_USE_VFIO;quoted
quoted
vsocket->net_compliant_ol_flags = flags & RTE_VHOST_USER_NET_COMPLIANT_OL_FLAGS; if (vsocket->async_copy &&diff --git a/lib/vhost/vhost.h b/lib/vhost/vhost.h index 8078ddff79..fb775ce4ed 100644 --- a/lib/vhost/vhost.h +++ b/lib/vhost/vhost.h@@ -370,6 +370,7 @@ struct virtio_net { int16_t broadcast_rarp; uint32_t nr_vring; int async_copy; + int async_use_vfio; int extbuf; int linearbuf; struct vhost_virtqueue*virtqueue[VHOST_MAX_QUEUE_PAIRS *quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
2];diff --git a/lib/vhost/vhost_user.c b/lib/vhost/vhost_user.c index 8f0eba6412..f3703f2e72 100644 --- a/lib/vhost/vhost_user.c +++ b/lib/vhost/vhost_user.c@@ -45,6 +45,7 @@ #include <rte_common.h> #include <rte_malloc.h> #include <rte_log.h> +#include <rte_vfio.h> #include "iotlb.h" #include "vhost.h"@@ -141,6 +142,36 @@ get_blk_size(int fd) return ret == -1 ? (uint64_t)-1 : (uint64_t)stat.st_blksize; } +static int +async_dma_map(struct rte_vhost_mem_region *region, booldo_map)quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+{ + int ret = 0; + uint64_t host_iova; + host_iova = rte_mem_virt2iova((void *)(uintptr_t)region->host_user_addr); + if (do_map) { + /* Add mapped region into the default container of DPDK.*/quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+ ret =rte_vfio_container_dma_map(RTE_VFIO_DEFAULT_CONTAINER_FD,quoted
quoted
+ region->host_user_addr, + host_iova, + region->size); + if (ret) { + VHOST_LOG_CONFIG(ERR, "DMA engine map failed\n"); + return ret; + } + } else { + /* Remove mapped region from the default container of DPDK. */ + ret =rte_vfio_container_dma_unmap(RTE_VFIO_DEFAULT_CONTAINER_FD,quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+ region->host_user_addr, + host_iova, + region->size); + if (ret) { + VHOST_LOG_CONFIG(ERR, "DMA engine unmapfailed\n");quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+ return ret; + } + } + return ret; +}We've been discussing this off list with Xuan, and unfortunately this is a blocker for now. Currently, the x86 IOMMU does not support partial unmap - thesegmentsquoted
quoted
quoted
quoted
quoted
quoted
have to be unmapped exactly the same addr/len as they weremapped.quoted
quoted
Wequoted
quoted
quoted
quoted
quoted
quoted
also concatenate adjacent mappings to prevent filling up the DMA mapping entry table with superfluous entries. This means that, when two unrelated mappings are contiguous inmemoryquoted
quoted
quoted
quoted
quoted
quoted
(e.g. if you map regions 1 and 2 independently, but they happen tobequoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
sitting right next to each other in virtual memory), we cannot later unmap one of them because, even though these are two separatemappingsquoted
as far as kernel VFIO infrastructure is concerned, the mapping gets compacted and looks like one single mapping to VFIO, so DPDK APIwillquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
not let us unmap region 1 without also unmapping region 2. The proper fix for this problem would be to always map memory page-by-page regardless of where it comes from (we already dothat forquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
internal memory, but not for external). However, the reason thisworksquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
for internal memory is because when mapping internal memorysegments,quoted
quoted
quoted
quoted
quoted
quoted
*we know the page size*. For external memory segments, there isnoquoted
quoted
suchquoted
quoted
quoted
quoted
quoted
quoted
guarantee, so we cannot deduce page size for a given memorysegment,quoted
quoted
quoted
quoted
quoted
quoted
quoted
andquoted
thus can't map things page-by-page. So, the proper fix for it would be to add page size to the VFIO DMA API. Unfortunately, it probably has to wait until 21.11 because it is anAPIquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
change. The slightly hacky fix for this would be to forego user mem map concatenation and trust that user is not going to do anythingstupid,quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
and will not spam the VFIO DMA API without reason. I wouldratherquoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
not go down this road, but this could be an option in this case. Thoughts?Thanks Anatoly for the detailed description of the issue. It may be possible to either create a versioned symbol for this API change, or maybe even to have a temporary internal API. But I think this series in its current form is not acceptable, so waiting for v21.11 would be the best option (we may want to sendthequoted
quoted
quoted
quoted
quoted
quoted
quoted
deprecation notice in this release though). In this series, I don't like the user application has to pass a flag to state whether the DMA engine uses VFIO or not. AFAICT, this newrevisionquoted
quoted
quoted
quoted
quoted
quoted
quoted
does not implement what was discussed in the previous one, i.e. supporting both IOVA_AS_VA and IOVA_AS_PA.Thanks for your comments. Here I hope to explain some questions: 1. Whether both IOVA_AS_VA and IOVA_AS_PA are supported now? A: Both IOVA_AS_PA and IOVA_AS_VA are supported now. In thisversion,quoted
quoted
thequoted
quoted
quoted
quoted
virtual address is replaced with iova address of mapped region, and the iova address is selected to program the IOMMU instead of virtual addressonly.quoted
quoted
quoted
quoted
Good!quoted
quoted
2. Why a flag is chosen to be passed by application? A: Yes, as we discussed before, the rte_eal_iova_mode() API can be used to get the IOVA mode, so as to determine whether IOMMU should beprogrammed.quoted
quoted
However, in the implementation process, I found a problem. That ishow toquoted
quoted
quoted
quoted
quoted
quoted
distinguish the VFIO PA and IGB_UIO PA. Because for VFIO cases, weshouldquoted
quoted
quoted
quoted
always program the IOMMU. While in IGB_UIO cases, it depends onIOMMUquoted
quoted
quoted
quoted
capability of platform.How does one program IOMMU with igb_uio? I was under impressionthatquoted
quoted
quoted
quoted
quoted
igb_uio (and uio_pci_generic for that matter) does not provide such facilities.+1Maybe some misunderstanding in this sentence here. In our design, if rte_eal_vfio_is_enabled("vfio") is true, iommu will beprogrammed.quoted
True means vfio module is modprobed. But there is an exception here, that is, even if vfio module is modprobed, DPDK user still bind all the devices to igb_uio. This situation can be distinguished in DPDK eal initialization, because theresource mappingquoted
is according to the driver loaded by each device(rte_pci_map_device). But in our scenario, this judgment is somewhat weak. Because we cannotgetquoted
quoted
quoted
the device driver info in vhost library. I also think it is unreasonable forvhost toquoted
quoted
quoted
do this. Only trust that users will not use it like this. Thoughts for thisscenario?quoted
quoted
I don't see how igb_uio would make any difference at all. If you are using igb_uio, you *don't have DMA mapping at all* and will use raw physical addresses. Assuming your code supports this, that's all you're ever going to get. The point of VFIO is to have memory regions that are mapped for DMA *because real physical addresses are assumed to be not available*. When you're using igb_uio, you effectively do have DMA access to the entire memory, and thus can bypass IOMMU altogether (assuming you're using passthrough mode).My concern is exactly here. In igb_uio cases, although devices are not added to the default container ineal init,quoted
but the "IOMMU programming" actually happens when therte_vfio_container_dma_map() is called.quoted
It is no harm but it is also unnecessary.Yes, it is unnecessary, but it's also not actively harmful, which means you can still do it without any regard as to whether you do or don't have IOMMU :) Think of a hybrid VFIO/igb_uio setup - some NICs will have VFIO, some will have igb_uio. The igb_uio-bound NICs will not care if you have mapped anything for DMA because they don't go through IOMMU, things will "just work". The VFIO-bound NICs will get the memory mapped, because they are the ones who actually need the DMA mapping. So, what you get is, if you do VFIO DMA mapping unconditionally, 1) NICs with igb_uio won't care about this, and 2) NICs with VFIO will benefit. You're not "mapping" the NICs, you're mapping the memory you're accessing with those NICs. You need it to be accessible to both, but since you have no way of knowing whether 1) any of the current HW needs VFIO, and 2) any of *future hotplugged* HW needs VFIO, the easiest way to solve this problem is just to map things regardless, and live with the "unnecessary" but harmless mapping in the worst case.
Get your point! It's just such a worst case bothers me. I have been thinking about how to avoid the igb_uio case programming IOMMU. But I cannot realize this just through a judgement. Since it is harmless in this case, not to mention, a platform without IOMMU won’t do anything useful. I think it works to program IOMMU unconditionally.
quoted
quoted
Bottom line: do VFIO DMA mapping unconditionally. If VFIO is active - great, the memory will be DMA mapped. If it's not active - no harm will ever be done by mapping the memory for DMA anyway.Do VFIO DMA mapping unconditionally, do you mean therte_eal_vfio_is_enabled() is unnecessary?quoted
What if the platform does not have IOMMU? Thanks very much.If the platform has no IOMMU, the API call will just not do anything useful, so no harm done.
So the only thing remained is the API change for page-by-page mapping in next release. Thanks, Xuan
quoted
Regards, Xuanquoted
-- Thanks, Anatoly-- Thanks, Anatoly