Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING
From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2020-08-21 20:17:26
Also in:
linux-iommu, linux-pci, lkml, xen-devel
On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:quoted
On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:quoted
+static void ims_mask_irq(struct irq_data *data) +{ + struct msi_desc *desc = irq_data_get_msi_desc(data); + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; + u32 __iomem *ctrl = &slot->ctrl; + + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);Just to be clear, this is exactly the sort of operation we can't do with non-MSI interrupts. For a real PCI device to execute this it would have to keep the data on die.We means NVIDIA and your new device, right?
We'd like to use this in the current Mellanox NIC HW, eg the mlx5 driver. (NVIDIA acquired Mellanox recently)
So if I understand correctly then the queue memory where the MSI descriptor sits is in RAM.
Yes, IMHO that is the whole point of this 'IMS' stuff. If devices could have enough on-die memory then they could just use really big MSI-X tables. Currently due to on-die memory constraints mlx5 is limited to a few hundred MSI-X vectors. Since MSI-X tables are exposed via MMIO they can't be 'swapped' to RAM. Moving away from MSI-X's MMIO access model allows them to be swapped to RAM. The cost is that accessing them for update is a command/response operation not a MMIO operation. The HW is already swapping the queues causing the interrupts to RAM, so adding a bit of additional data to store the MSI addr/data is reasonable. To give some sense, a 'working set' for the NIC device in some cases can be hundreds of megabytes of data. System RAM is used to store this, and precious on-die memory holds some dynamic active set, much like a processor cache.
How is that supposed to work if interrupt remapping is disabled?
The best we can do is issue a command to the device and spin/sleep until completion. The device will serialize everything internally. If the device has died the driver has code to detect and trigger a PCI function reset which will definitely stop the interrupt. So, the implementation of these functions would be to push any change onto a command queue, trigger the device to DMA the command, spin/sleep until the device returns a response and then continue on. If the device doesn't return a response in a time window then trigger a WQ to do a full device reset. The spin/sleep is only needed if the update has to be synchronous, so things like rebalancing could just push the rebalancing work and immediately return.
If interrupt remapping is enabled then both are trivial because then the irq chip can delegate everything to the parent chip, i.e. the remapping unit.
I did like this notion that IRQ remapping could avoid the overhead of spin/spleep. Most of the use cases we have for this will require the IOMMU anyhow.
quoted
I saw the idxd driver was doing something like this, I assume it avoids trouble because it is a fake PCI device integrated with the CPU, not on a real PCI bus?That's how it is implemented as far as I understood the patches. It's device memory therefore iowrite32().
I don't know anything about idxd.. Given the scale of interrupt need I assumed the idxd HW had some hidden swapping to RAM. Since it is on-die with the CPU there are a bunch of ways I could imagine Intel could make MMIO triggered swapping work that are not available to a true PCI-E device. Jason