Re: [PATCH kernel] powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains
From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2022-07-07 15:11:15
Also in:
kvm
On Thu, Jul 07, 2022 at 11:55:52PM +1000, Alexey Kardashevskiy wrote:
Historically PPC64 managed to avoid using iommu_ops. The VFIO driver uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in the Type1 VFIO driver. Recent development though has added a coherency capability check to the generic part of VFIO and essentially disabled VFIO on PPC64; the similar story about iommu_group_dma_owner_claimed(). This adds an iommu_ops stub which reports support for cache coherency. Because bus_set_iommu() triggers IOMMU probing of PCI devices, this provides minimum code for the probing to not crash. Because now we have to set iommu_ops to the system (bus_set_iommu() or iommu_device_register()), this requires the POWERNV PCI setup to happen after bus_register(&pci_bus_type) which is postcore_initcall TODO: check if it still works, read sha1, for more details: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5537fcb319d016ce387 Because setting the ops triggers probing, this does not work well with iommu_group_add_device(), hence the move to iommu_probe_device(). Because iommu_probe_device() does not take the group (which is why we had the helper in the first place), this adds pci_controller_ops::device_group. So, basically there is one iommu_device per PHB and devices are added to groups indirectly via series of calls inside the IOMMU code. pSeries is out of scope here (a minor fix needed for barely supported platform in regard to VFIO). The previous discussion is here: https://patchwork.ozlabs.org/project/kvm-ppc/patch/20220701061751.1955857-1-aik@ozlabs.ru/
I think this is basically OK, for what it is. It looks like there is more some-day opportunity to make use of the core infrastructure though.
does it make sense to have this many callbacks, or the generic IOMMU code can safely operate without some (given I add some more checks for !NULL)? thanks,
I wouldn't worry about it..
quoted hunk ↗ jump to hunk
@@ -1156,7 +1158,10 @@ int iommu_add_device(struct iommu_table_group *table_group, struct device *dev) pr_debug("%s: Adding %s to iommu group %d\n", __func__, dev_name(dev), iommu_group_id(table_group->group)); - return iommu_group_add_device(table_group->group, dev); + ret = iommu_probe_device(dev); + dev_info(dev, "probed with %d\n", ret);
For another day, but it seems a bit strange to call iommu_probe_device() like this? Shouldn't one of the existing call sites cover this? The one in of_iommu.c perhaps?
+static bool spapr_tce_iommu_is_attach_deferred(struct device *dev)
+{
+ return false;
+}
I think you can NULL this op:
static bool iommu_is_attach_deferred(struct device *dev)
{
const struct iommu_ops *ops = dev_iommu_ops(dev);
if (ops->is_attach_deferred)
return ops->is_attach_deferred(dev);
return false;
}
+static struct iommu_group *spapr_tce_iommu_device_group(struct device *dev)
+{
+ struct pci_controller *hose;
+ struct pci_dev *pdev;
+
+ /* Weirdly iommu_device_register() assigns the same ops to all buses */
+ if (!dev_is_pci(dev))
+ return ERR_PTR(-EPERM);
+
+ pdev = to_pci_dev(dev);
+ hose = pdev->bus->sysdata;
+
+ if (!hose->controller_ops.device_group)
+ return ERR_PTR(-ENOENT);
+
+ return hose->controller_ops.device_group(hose, pdev);
+}Is this missing a refcount get on the group?
+
+static int spapr_tce_iommu_attach_dev(struct iommu_domain *dom,
+ struct device *dev)
+{
+ return 0;
+}It is important when this returns that the iommu translation is all emptied. There should be no left over translations from the DMA API at this point. I have no idea how power works in this regard, but it should be explained why this is safe in a comment at a minimum. It will turn into a security problem to allow kernel mappings to leak past this point. Jason