Re: [PATCH kernel 9/9] KVM: PPC: Add in-kernel acceleration for VFIO

From: Alexey Kardashevskiy <hidden>
Date: 2016-12-14 03:53:22
Also in: linuxppc-dev

On 10/12/16 02:35, Alex Williamson wrote:

On Fri, 9 Dec 2016 18:53:43 +1100
Alexey Kardashevskiy [off-list ref] wrote:

quoted

On 09/12/16 04:55, Alex Williamson wrote:

quoted

On Thu,  8 Dec 2016 19:19:56 +1100
Alexey Kardashevskiy [off-list ref] wrote:

quoted

This allows the host kernel to handle H_PUT_TCE, H_PUT_TCE_INDIRECT
and H_STUFF_TCE requests targeted an IOMMU TCE table used for VFIO
without passing them to user space which saves time on switching
to user space and back.

This adds H_PUT_TCE/H_PUT_TCE_INDIRECT/H_STUFF_TCE handlers to KVM.
KVM tries to handle a TCE request in the real mode, if failed
it passes the request to the virtual mode to complete the operation.
If it a virtual mode handler fails, the request is passed to
the user space; this is not expected to happen though.

To avoid dealing with page use counters (which is tricky in real mode),
this only accelerates SPAPR TCE IOMMU v2 clients which are required
to pre-register the userspace memory. The very first TCE request will
be handled in the VFIO SPAPR TCE driver anyway as the userspace view
of the TCE table (iommu_table::it_userspace) is not allocated till
the very first mapping happens and we cannot call vmalloc in real mode.

This adds new attribute - KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE - to
the VFIO KVM device. It takes a VFIO group fd and SPAPR TCE table fd
and associates a physical IOMMU table with the SPAPR TCE table (which
is a guest view of the hardware IOMMU table). The iommu_table object
is referenced so we do not have to retrieve in real mode when hypercall
happens.

This does not implement the UNSET counterpart as there is no use for it -
once the acceleration is enabled, the existing userspace won't
disable it unless a VFIO container is detroyed so this adds necessary
cleanup to the KVM_DEV_VFIO_GROUP_DEL handler.

This uses the kvm->lock mutex to protect against a race between
the VFIO KVM device's kvm_vfio_destroy() and SPAPR TCE table fd's
release() callback.

This advertises the new KVM_CAP_SPAPR_TCE_VFIO capability to the user
space.

This finally makes use of vfio_external_user_iommu_id() which was
introduced quite some time ago and was considered for removal.

Tests show that this patch increases transmission speed from 220MB/s
to 750..1020MB/s on 10Gb network (Chelsea CXGB3 10Gb ethernet card).

Signed-off-by: Alexey Kardashevskiy <redacted>
---
 Documentation/virtual/kvm/devices/vfio.txt |  21 +-
 arch/powerpc/include/asm/kvm_host.h        |   8 +
 arch/powerpc/include/asm/kvm_ppc.h         |   5 +
 include/uapi/linux/kvm.h                   |   8 +
 arch/powerpc/kvm/book3s_64_vio.c           | 302 +++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_64_vio_hv.c        | 178 +++++++++++++++++
 arch/powerpc/kvm/powerpc.c                 |   2 +
 virt/kvm/vfio.c                            | 108 +++++++++++
 8 files changed, 630 insertions(+), 2 deletions(-)

diff --git a/Documentation/virtual/kvm/devices/vfio.txt b/Documentation/virtual/kvm/devices/vfio.txt
index ef51740c67ca..ddb5a6512ab3 100644
--- a/Documentation/virtual/kvm/devices/vfio.txt
+++ b/Documentation/virtual/kvm/devices/vfio.txt

@@ -16,7 +16,24 @@ Groups:
 
 KVM_DEV_VFIO_GROUP attributes:
   KVM_DEV_VFIO_GROUP_ADD: Add a VFIO group to VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
   KVM_DEV_VFIO_GROUP_DEL: Remove a VFIO group from VFIO-KVM device tracking
+	kvm_device_attr.addr points to an int32_t file descriptor
+	for the VFIO group.
+  KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE: attaches a guest visible TCE table
+	allocated by sPAPR KVM.
+	kvm_device_attr.addr points to a struct:
 
-For each, kvm_device_attr.addr points to an int32_t file descriptor
-for the VFIO group.
+	struct kvm_vfio_spapr_tce {
+		__u32	argsz;
+		__s32	groupfd;
+		__s32	tablefd;
+		__u8	pad[4];
+	};
+
+	where
+	@argsz is the size of kvm_vfio_spapr_tce_liobn;
+	@groupfd is a file descriptor for a VFIO group;
+	@tablefd is a file descriptor for a TCE table allocated via
+		KVM_CREATE_SPAPR_TCE.

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 28350a294b1e..94774503c70d 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h

@@ -191,6 +191,13 @@ struct kvmppc_pginfo {
 	atomic_t refcnt;
 };
 
+struct kvmppc_spapr_tce_iommu_table {
+	struct rcu_head rcu;
+	struct list_head next;
+	struct iommu_table *tbl;
+	atomic_t refs;
+};
+
 struct kvmppc_spapr_tce_table {
 	struct list_head list;
 	struct kvm *kvm;

@@ -199,6 +206,7 @@ struct kvmppc_spapr_tce_table {
 	u32 page_shift;
 	u64 offset;		/* in pages */
 	u64 size;		/* window size in pages */
+	struct list_head iommu_tables;
 	struct page *pages[0];
 };

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 0a21c8503974..17b947a0060d 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h

@@ -163,6 +163,11 @@ extern long kvmppc_prepare_vrma(struct kvm *kvm,
 extern void kvmppc_map_vrma(struct kvm_vcpu *vcpu,
 			struct kvm_memory_slot *memslot, unsigned long porder);
 extern int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu);
+extern long kvm_spapr_tce_attach_iommu_group(struct kvm *kvm,
+				int tablefd,
+				struct iommu_group *grp);
+extern void kvm_spapr_tce_detach_iommu_group(struct kvm *kvm,
+				struct iommu_group *grp);
 
 extern long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
 				struct kvm_create_spapr_tce_64 *args);

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 810f74317987..9e4025724e28 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h

@@ -1068,6 +1068,7 @@ struct kvm_device_attr {
 #define  KVM_DEV_VFIO_GROUP			1
 #define   KVM_DEV_VFIO_GROUP_ADD			1
 #define   KVM_DEV_VFIO_GROUP_DEL			2
+#define   KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE		3
 
 enum kvm_device_type {
 	KVM_DEV_TYPE_FSL_MPIC_20	= 1,

@@ -1089,6 +1090,13 @@ enum kvm_device_type {
 	KVM_DEV_TYPE_MAX,
 };
 
+struct kvm_vfio_spapr_tce {
+	__u32	argsz;
+	__s32	groupfd;
+	__s32	tablefd;
+	__u8	pad[4];
+};

I assume you're implementing argsz and padding for future expansion,
but it doesn't really work.  Presumably argsz would be set to 16, so
the only way that the kernel will ever know something has changed would
be to make it bigger, so the padding bytes are really reserved, and
then it's not clear why we have padding at all.  If you replaced the
padding with a __u32 flags, then we could actually have some room to
architect expansion, but as it is we might as well drop both argsz and
padding.

Ah, right, sorry, did not pay attention to this bit this time. I'll replace
pad with flags and move to argsz.

quoted

+
 /*
  * ioctls for VM fds
  */

...

quoted

--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c

@@ -20,6 +20,10 @@
 #include <linux/vfio.h>
 #include "vfio.h"
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+#include <asm/kvm_ppc.h>
+#endif
+
 struct kvm_vfio_group {
 	struct list_head node;
 	struct vfio_group *vfio_group;

@@ -76,6 +80,22 @@ static bool kvm_vfio_group_is_coherent(struct vfio_group *vfio_group)
 	return ret > 0;
 }
 
+static int kvm_vfio_external_user_iommu_id(struct vfio_group *vfio_group)
+{
+	int (*fn)(struct vfio_group *);
+	int ret = -1;
+
+	fn = symbol_get(vfio_external_user_iommu_id);
+	if (!fn)
+		return ret;
+
+	ret = fn(vfio_group);
+
+	symbol_put(vfio_external_user_iommu_id);
+
+	return ret;
+}
+
 /*
  * Groups can use the same or different IOMMU domains.  If the same then
  * adding a new group may change the coherency of groups we've previously

@@ -110,6 +130,22 @@ static void kvm_vfio_update_coherency(struct kvm_device *dev)
 	mutex_unlock(&kv->lock);
 }
 
+#ifdef CONFIG_SPAPR_TCE_IOMMU
+static void kvm_vfio_spapr_detach_iommu_group(struct kvm *kvm,
+		struct vfio_group *vfio_group)
+{
+	int group_id;
+	struct iommu_group *grp;
+
+	group_id = kvm_vfio_external_user_iommu_id(vfio_group);
+	grp = iommu_group_get_by_id(group_id);
+	if (grp) {
+		kvm_spapr_tce_detach_iommu_group(kvm, grp);
+		iommu_group_put(grp);
+	}
+}
+#endif


I wonder if you could use the new vfio group notifier to avoid tainting
this code with SPAPR_TCE #ifdefs.  Thanks,

I cannot see how... The new notifier sets kvm to a group, I need the
opposite - attach a group to kvm and not just to KVM but to a specific KVM
SPAPR TCE fd (which is a child object of KVM and which owns a LIOBN bus id).

The only way I see how I can avoid tainting this code is adding another
ioctl() to PPC KVM (or its child object - SPAPR TCE object), and I tried
that few years ago and I was told to add a KVM device or even reuse VFIO
KVM device.

What am I missing here?

You would still need a KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE, but the ugly
part of encompassing this all in #ifdefs is that we call
kvm_spapr_tce_{at,de}tach_iommu_group() directly. The notifier would
sort of make it like an arch callback, vfio core would set these
attributes and broadcast them to notifier callbacks, on non-spapr-tce
platforms nobody would be listening for those notifications.
Ultimately I don't know how much cleaner it makes things, but it maybe
avoids spapr-tce #ifdefs leaking into every layer of the stack.  Thanks,

I am failing here.

The normal workflow:
- create SPAPR TCE object in KVM, represents 1 LIOBN aka DMA window;
- attach IOMMU group to SPAPR TCE object, in this step the hardware tables
(1 or 2 iommu_table objects) are put to the SPAPR TCE's list of attached
tables; the tables are referenced.

When reboot happens, the SPAPR TCE object is destroyed and new guest starts
from the very beginning.


The task I am solving: dereference iommu_table (hardware table) in 2 cases:
1) guest reboot - SPAPR TCE table is destroyed but VFIO KVM device is still
there with all attached groups;
2) VFIO PCI hot unplug - SPAPR TCE table is there but groups need to be
detached from the VFIO KVM device.


Tried fixing 2) with the new callbacks:

- they do not take iommu_group, they take devices - fixed by duplicating
the vfio_(un)register_notifier API:
  * vfio_iommu_group_register_notifier()
  * vfio_iommu_group_unregister_notifier()
plus kvm wrappers with symbol_get/symbol_put in
arch/powerpc/kvm/book3s_64_vio.c.

- the callbacks are registered per a IOMMU group, the only place in this
code which knows about groups is SPAPR TCE driver but attach_group()
callback is called when IOMMU driver is not yet set ->
vfio_register_notifier() fails. KVM does not know about groups until
KVM_DEV_VFIO_GROUP_ADD/KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE so I register
callback in KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE;

- the callback does not pass vfio_group pointer, only kvm; so my notifier
needs to be wrapped into a struct with a group pointer, ok, done;

- the notifier needs to be unregistered and the wrapper struct from the
previous step needs to be freed. No nice mechanisms for that - I cannot
unregister a notifier from a notifier itself. I fixed it by calling
rcu_sched() from the notifier when KVM==NULL and RCU-scheduled callback
calls vfio_iommu_group_unregister_notifier().

Looks quite ugly already but ok.


Now I am trying to solve 1). I can dereference iommu_table objects but
registered notifiers remain in memory and they are not freed so each guest
reboot increases the list length. I do not have a way to access vfio_group
structs from KVM, there I only have a list of iommu_table structs, each of
which has a list of iommu_group structs (this is done this way to make real
mode handlers possible) but there is no way to get to vfio_group struct
from iommu_group.

I can duplicate group list once again, this time is will be vfio_group list
attached to SPAPR TCE object but this all seems to be way to much, does not
it?...




-- 
Alexey

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help