Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING | linux-hyperv

[patch RFC 00/38] x86, PCI, XEN, genirq ...: Prepare for device MSI · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 02/38] x86/init: Remove unused init ops · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 07/38] iommu/irq_remapping: Consolidate irq domain lookup · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 13/38] PCI: MSI: Rework pci_msi_domain_calc_hwirq() · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 13/38] PCI: MSI: Rework pci_msi_domain_calc_hwirq() · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
Re: [patch RFC 13/38] PCI: MSI: Rework pci_msi_domain_calc_hwirq() · Thomas Gleixner <hidden> · 2020-08-25
[patch RFC 14/38] x86/msi: Consolidate MSI allocation · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 16/38] x86/irq: Move apic_post_init() invocation to one place · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 24/38] x86/xen: Consolidate XEN-MSI init · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 24/38] x86/xen: Consolidate XEN-MSI init · Jürgen Groß <jgross@suse.com> · 2020-08-24
Re: [patch RFC 24/38] x86/xen: Consolidate XEN-MSI init · Thomas Gleixner <hidden> · 2020-08-24
Re: [patch RFC 24/38] x86/xen: Consolidate XEN-MSI init · Jürgen Groß <jgross@suse.com> · 2020-08-25
Re: [patch RFC 24/38] x86/xen: Consolidate XEN-MSI init · Thomas Gleixner <hidden> · 2020-08-25
[patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Jason Gunthorpe <jgg@nvidia.com> · 2020-08-21
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Jason Gunthorpe <jgg@nvidia.com> · 2020-08-21
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Jason Gunthorpe <jgg@nvidia.com> · 2020-08-22
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Thomas Gleixner <hidden> · 2020-08-22
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Jason Gunthorpe <jgg@nvidia.com> · 2020-08-22
Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING · Thomas Gleixner <hidden> · 2020-08-23
[patch RFC 26/38] x86/xen: Wrap XEN MSI management into irqdomain · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 26/38] x86/xen: Wrap XEN MSI management into irqdomain · Jürgen Groß <jgross@suse.com> · 2020-08-24
Re: [patch RFC 26/38] x86/xen: Wrap XEN MSI management into irqdomain · Thomas Gleixner <hidden> · 2020-08-25
[patch RFC 25/38] irqdomain/msi: Allow to override msi_domain_alloc/free_irqs() · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 20/38] PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 20/38] PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
[patch RFC 36/38] platform-msi: Add device MSI infrastructure · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 37/38] irqdomain/msi: Provide msi_alloc/free_store() callbacks · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 15/38] x86/msi: Use generic MSI domain ops · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 35/38] platform-msi: Provide default irq_chip::ack · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 34/38] x86/msi: Let pci_msi_prepare() handle non-PCI MSI · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 34/38] x86/msi: Let pci_msi_prepare() handle non-PCI MSI · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
Re: [patch RFC 34/38] x86/msi: Let pci_msi_prepare() handle non-PCI MSI · Thomas Gleixner <hidden> · 2020-08-25
Re: [patch RFC 34/38] x86/msi: Let pci_msi_prepare() handle non-PCI MSI · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
[patch RFC 32/38] x86/irq: Make most MSI ops XEN private · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 31/38] x86/irq: Cleanup the arch_*_msi_irqs() leftovers · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 33/38] x86/irq: Add DEV_MSI allocation type · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
Re: [patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Thomas Gleixner <hidden> · 2020-08-25
Re: [patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
Re: [patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Thomas Gleixner <hidden> · 2020-08-25
Re: [patch RFC 30/38] PCI/MSI: Allow to disable arch fallbacks · Thomas Gleixner <hidden> · 2020-08-25
[patch RFC 29/38] x86/pci: Set default irq domain in pcibios_add_device() · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 27/38] iommm/vt-d: Store irq domain in struct device · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 28/38] iommm/amd: Store irq domain in struct device · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 23/38] x86/xen: Rework MSI teardown · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 23/38] x86/xen: Rework MSI teardown · Jürgen Groß <jgross@suse.com> · 2020-08-24
[patch RFC 21/38] PCI: MSI: Provide pci_dev_has_special_msi_domain() helper · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 21/38] PCI: MSI: Provide pci_dev_has_special_msi_domain() helper · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
[patch RFC 22/38] x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init() · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 22/38] x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init() · Jürgen Groß <jgross@suse.com> · 2020-08-24
[patch RFC 19/38] irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 18/38] x86/irq: Initialize PCI/MSI domain at PCI init time · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 17/38] x86/pci: Reducde #ifdeffery in PCI init code · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 17/38] x86/pci: Reducde #ifdeffery in PCI init code · Bjorn Helgaas <helgaas@kernel.org> · 2020-08-25
[patch RFC 12/38] x86/irq: Consolidate UV domain allocation · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 11/38] x86/irq: Consolidate DMAR irq allocation · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 01/38] iommu/amd: Prevent NULL pointer dereference · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 10/38] x86/ioapic: Consolidate IOAPIC allocation · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 10/38] x86/ioapic: Consolidate IOAPIC allocation · Boqun Feng <hidden> · 2020-08-26
Re: [patch RFC 10/38] x86/ioapic: Consolidate IOAPIC allocation · Thomas Gleixner <hidden> · 2020-08-26
[patch RFC 09/38] x86/msi: Consolidate HPET allocation · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 05/38] iommu/vt-d: Consolidate irq domain getter · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 08/38] x86/irq: Prepare consolidation of irq_alloc_info · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 06/38] iommu/amd: Consolidate irq domain getter · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 03/38] x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency · Thomas Gleixner <hidden> · 2020-08-21
[patch RFC 04/38] x86/irq: Add allocation type for parent domain retrieval · Thomas Gleixner <hidden> · 2020-08-21
Re: [patch RFC 00/38] x86, PCI, XEN, genirq ...: Prepare for device MSI · Jürgen Groß <jgross@suse.com> · 2020-08-22

Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

From: Jason Gunthorpe <jgg@nvidia.com>
Date: 2020-08-22 00:51:52
Also in: linux-iommu, linux-pci, lkml, xen-devel

On Sat, Aug 22, 2020 at 01:47:12AM +0200, Thomas Gleixner wrote:

On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote:

quoted

On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:

quoted

So if I understand correctly then the queue memory where the MSI
descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Right, that's the limit of a particular device, but nothing prevents you
to have a larger table on a new device.

Well, physics are a problem.. The SRAM to store the MSI vectors costs
die space and making the chip die larger is not an option. So the
question is what do you throw out of the chip to get a 10-20x increase
in MSI SRAM?

This is why using host memory is so appealing. It is
economically/functionally better.

I'm going to guess other HW is in the same situation, virtualization
is really pushing up the number of required IRQs.

quoted

How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

If that interrupt is gone into storm mode for some reason then this will
render your machine unusable before you can do that.

Yes, but in general the HW design is to have one-shot interrupts, it
would have to be well off the rails to storm. The kind of off the
rails where it could also be doing crazy stuff on PCI-E that would be
very harmful.

quoted

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

I really don't want to do that with the irq descriptor lock held or in
case of affinity from the interrupt handler as we have to do with PCI
MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck.
Also you cannot call into command queue code from interrupt disabled and
interrupt descriptor lock held sections. You can try, but lockdep will
yell at you immediately.

Yes, I wouldn't want to do this from an IRQ.

One question is whether the device can see partial updates to that
memory due to the async 'swap' of context from the device CPU.

It is worse than just partial updates.. The device operation is much
more like you'd imagine a CPU cache. There could be copies of the RAM
in the device for long periods of time, dirty data in the device that
will flush back to CPU RAM overwriting CPU changes, etc.

Without involving the device there is just no way to create data
consistency, and no way to change the data from the CPU. 

This is the down side of having device data in the RAM. It cannot be
so simple as 'just fetch it every time before you use it' as
performance would be horrible.

irq chips have already a mechanism in place to deal with stuff which
cannot be handled from within the irq descriptor spinlock held and
interrupt disabled section.

The mechanism was invented to deal with interrupt chips which are
connected to i2c, spi, etc.. The access to an interrupt chip control
register has to queue stuff on the bus and wait for completion.
Obviously not what you can do from interrupt disabled, raw spinlock held
context either.

Ah intersting, sounds like the right parts! I didn't know about this..

Now coming back to affinity setting. I'd love to avoid adding the bus
lock magic to those interfaces because until now they can be called and
are called from atomic contexts. And obviously none of the devices which
use the buslock magic support affinity setting because they all deliver
a single interrupt to a demultiplex interrupt and that one is usually
sitting at the CPU level where interrupt steering works.

If we really can get away with atomically updating the message as
outlined above and just let it happen at some point in the future then
most problems are solved, except for the nastyness of CPU hotplug.

Since we can't avoid a device command, I'm think more along the lines
of having the affinity update trigger an async WQ to issue the command
from a thread context. Since it doesn't need to be synchronous it can
make it out 'eventually'.

I suppose the core code could provide this as a service? Sort of a
varient of the other lazy things above?

But that's actually a non issue. Nothing prevents us from having an
early 'migrate interrupts away from the outgoing CPU hotplug state'
which runs in thread context and can therefore utilize the buslock
mechanism. Actually I was thinking about that for other reasons already.

That would certainly work well, seems like it fits with the other
lazy/sleeping stuff above as well.

quoted

If interrupt remapping is enabled then both are trivial because then the
irq chip can delegate everything to the parent chip, i.e. the remapping
unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

You still need to support !remap scenarios I fear.

For x86 I think we could accept linking this to IOMMU, if really
necessary.

But it would have to work with ARM - is remapping a x86 only thing?
Does ARM put the affinity in the GIC tables not in the MSI data?

Let me summarize what I think would be the sane solution for this:

  1) Utilize atomic writes for either all 16 bytes or reorder the bytes
     and update 8 bytes atomically which is sufficient as the wide
     address is only used with irq remapping and the MSI message in the
     device is never changed after startup.

Sadly not something the device can manage due to data coherence

  2) No requirement for issuing a command for regular migration
     operations as they have no requirements to be synchronous.

     Eventually store some state to force a reload on the next regular
     queue operation.

Would the async version above be OK?

  3) No requirement for issuing a command for mask and unmask operations.
     The core code uses and handles lazy masking already. So if the
     hardware causes the lazyness, so be it.

This lazy masking thing sounds good, I'm totally unfamiliar with it
though.

  4) Issue commands for startup and teardown as they need to be
     synchronous

Yep

  5) Have an early migration state for CPU hotunplug which issues a
     command from appropriate context. That would even allow to handle
     queue shutdown for managed interrupts when the last CPU in the
     managed affinity set goes down. Restart of such a managed interrupt
     when the first CPU in an affinity set comes online again would only
     need minor modifications of the existing code to make it work.

Yep

Thoughts?

This email is super helpful, I definately don't know all these corners
of the IRQ subsystem as my past with it has mostly been SOC stuff that
isn't as complicated!

Thanks,
Jason

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help