Re: [PATCH v3 29/29] iommu/arm-smmu-v3-kvm: Add IOMMU ops
From: Mostafa Saleh <smostafa@google.com>
Date: 2025-08-06 14:10:41
Also in:
kvmarm, linux-iommu, lkml
On Tue, Aug 05, 2025 at 02:57:53PM -0300, Jason Gunthorpe wrote:
On Mon, Aug 04, 2025 at 02:41:31PM +0000, Mostafa Saleh wrote:quoted
quoted
Are you saying the event queue is left behind for the kernel? How does that work if it doesn't have access to the registers?The evtq itself will be owned by the kernel, However, MMIO access would be trapped and emulated, here the PoC for part-2 of this series (as mentioned in the cover letter) This is very close to how nesting will work. https://android-kvm.googlesource.com/linux/+/refs/heads/for-upstream/pkvm-smmu-v3-part-2/drivers/iommu/arm/arm-smmu-v3/pkvm/arm-smmu-v3.c#744Oh weird, but Ok.quoted
quoted
In other words you have two cleanly seperate concerns here, an "pkvm iommu subsystem" that lets pkvm control iommu HW - and the current "iommu subsystem" that lets the kernel control iommu HW. The same driver should not register to both.I am not sure how that would work exactly, for example how would probe_device work, xlate... in a generic way?Well, I think it is not so bad actually. You just need to call iommu_device_register ret = iommu_device_register(&smmu->iommu, &arm_smmu_ops, dev); Where 'dev' is always the smmu struct device, even if your current probing driver is not the smmu device. That will capture all the iommu activity the ACPI/DT says is linked to that dev. From there you just make a normal small iommu driver with no connection back to the SMMUv3. Eg you could spawn an aux device from smmuv3 to do this with a far cleaner separation. xlate is just calling iommu_fwspec_add_ids() it is one line.
I am not sure I understand, the SMMU driver will register its IOMMU ops to probe the devices, then faux devices would also register IOMMU ops for the KVM HVCs? But that means that all DMA operations would still go through the SMMU one?
quoted
same for other ops. We can make some of these functions (hypercalls wrappers) in a separate file.What other ops are you worried about? static struct iommu_ops arm_smmu_ops = { .identity_domain = &arm_smmu_identity_domain, .blocked_domain = &arm_smmu_blocked_domain, .release_device = arm_smmu_release_device, .domain_alloc_paging_flags = arm_smmu_domain_alloc_paging_flags, ^^ Those are all your new domain type that does the hypercalls .probe_device = arm_smmu_probe_device, .get_resv_regions = arm_smmu_get_resv_regions, ^^ These are pretty empty .device_group = arm_smmu_device_group, .of_xlate = arm_smmu_of_xlate, ^^ Common code Don't need these: .capable = arm_smmu_capable, .hw_info = arm_smmu_hw_info, .domain_alloc_sva = arm_smmu_sva_domain_alloc, .page_response = arm_smmu_page_response, .def_domain_type = arm_smmu_def_domain_type, .get_viommu_size = arm_smmu_get_viommu_size, .viommu_init = arm_vsmmu_init, .user_pasid_table = 1, .pgsize_bitmap = -1UL, /* Restricted during device attach */ .owner = THIS_MODULE,
Makes sense, thanks for the detailed explanation.
quoted
Also I am not sure how that looks from the kernel perspective (do we have 2 struct devices per SMMU?)I think you'd want to have pkvm bound to the physical struct device and then spawn new faux, aux or something devices for the virtualized IOMMUs that probes the new paravirt driver. This driver would be fully self contained.
I think it’s hard to reason about this as 2 devices, from my pov it seems that the pKVM HVCs are a library that can be part of separate common file, then called from drivers. (with common ops) Instead of having extra complexity of 2 drivers (KVM and IOMMU PV). However, I can see the value of that as it totally abstracts the iommu ops outside the device specific code, I will give it more thought. But it feels that might be more suitable for a full fledged PV implementation (as in RFC v1 and v2).
quoted
I think we are on the same page about how that will look at the end. For nesting there will be a pKVM driver (as mentioned in the cover letter) to probe register the SMMUs, then it will unbind itself to let the currentThat sounds a little freaky to me. I guess it can be made to work, and is maybe the best option, but if this is the longterm plan then starting out with a non-iommu subsystem pkvm driver seems like a good idea. Today the pkvm driver would do enough to boot up pkvm in this limited mode, and maybe you have some small code duplications with the iommu driver. It forks off a aux device to create the para-virt pkvm iommu subsystem driver. Tomorrow you further shrink down the pkvm driver and remove the para-virt driver, then just somehow bind/unbind the pkvm one at early boot. Minimal changes to the existing smmu driver in either case.quoted
Then the hypervisor driver will use trap and emulate to handle SMMU access to the MMIO, providing an architecturally accurate SMMUv3 emualation, and it will not register iommu_ops.Well, it registers normal smmuv3 ops only, I think you mean.
I mean when nesting is added, the arm-smmu-v3-kvm, will not call “iommu_device_register”
quoted
So, I will happily drop the hypercalls and the iommu_ops from this series, if there is a better way to enlighten the hypervisor about the SIDs to be in identity.The iommu ops are a reasonable and appropriate way to do this. But since pkvm is all so special anyhow maybe you could hook pci_enable_device() and do your identity hypercall there? Do you already trap the config write to catch Bus Master Enable? Can pkvm just set identity when BME=1 and ABORT when BME=0?
pKVM doesn’t trap actual device access, also we have to support platform devices, so it would be better to rely on existing abstractions as “probe_device” I had an offline discussion with Will and Robin and they believe it might be better if we get rid of the kernel KVM SMMUv3 driver at all, and just rely on ARM_SMMU_V3 + extra hooks, so there is a single driver managing the SMMUs in the system. This way we don’t need to split current SMMUv3 or have different IOMMU ops, and reduces some of the duplication, also that avoids the need for a fake device. Then we have an extra file for KVM with some of the hooks (similar to the hooks in arm_smmu_impl_ops we have for tegra) And that might be more suitable for nesting also, to avoid the bind/unbind flow. I will investigate that and if feasible I will send v4 (hopefully shortly) based on this idea, otherwise I will see if we can separate KVM code and SMMU bootstrap code. Thanks, Mostafa
Jason