Thread (112 messages) 112 messages, 18 authors, 2023-01-15

Re: [patch V2 00/46] x86, PCI, XEN, genirq ...: Prepare for device MSI

From: Qian Cai <hidden>
Date: 2020-09-25 15:29:34
Also in: linux-iommu, linux-next, linux-pci, lkml, xen-devel

On Wed, 2020-08-26 at 13:16 +0200, Thomas Gleixner wrote:
This is the second version of providing a base to support device MSI (non
PCI based) and on top of that support for IMS (Interrupt Message Storm)
based devices in a halfways architecture independent way.

The first version can be found here:

    https://lore.kernel.org/r/20200821002424.119492231@linutronix.de (local)

It's still a mixed bag of bug fixes, cleanups and general improvements
which are worthwhile independent of device MSI.
Reverting the part of this patchset on the top of today's linux-next fixed an
boot issue on HPE ProLiant DL560 Gen10, i.e.,

$ git revert --no-edit 13b90cadfc29..bc95fd0d7c42

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/x86.config

It looks like the crashes happen in the interrupt remapping code where they are
only able to to generate partial call traces.

[    1.912386][    T0] ACPI: X2APIC_NMI (uid[0xf5] high level 9983][    T0] ... MAX_LOCK_DEPTH:          48
[    7.914876][    T0] ... MAX_LOCKDEP_KEYS:        8192
[    7.919942][    T0] ... CLASSHASH_SIZE:          4096
[    7.925009][    T0] ... MAX_LOCKDEP_ENTRIES:     32768
[    7.930163][    T0] ... MAX_LOCKDEP_CHAINS:      65536
[    7.935318][    T0] ... CHAINHASH_SIZE:          32768
[    7.940473][    T0]  memory used by lock dependency info: 6301 kB
[    7.946586][    T0]  memory used for stack traces: 4224 kB
[    7.952088][    T0]  per task-struct memory footprint: 1920 bytes
[    7.968312][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    7.980281][    T0] ACPI: Core revision 20200717
[    7.993343][    T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[    8.003270][    T0] APIC: Switch to symmetric I/O mode setup
[    8.008951][    T0] DMAR: Host address width 46
[    8.013512][    T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[    8.019680][    T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 [    T0] DMAR-IR: IOAPIC id 15 under DRHD base  0xe5ffc000 IOMMU 0
[    8.420990][    T0] DMAR-IR: IOAPIC id 8 under DRHD base  0xddffc000 IOMMU 15
[    8.428166][    T0] DMAR-IR: IOAPIC id 9 under DRHD base  0xddffc000 IOMMU 15
[    8.435341][    T0] DMAR-IR: HPET id 0 under DRHD base 0xddffc000
[    8.441456][    T0] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    8.457911][    T0] DMAR-IR: Enabled IRQ remapping in x2apic mode
[    8.466614][    T0] BUG: kernel NULL pointer dereference, address: 0000000000000000
[    8.474295][    T0] #PF: supervisor instruction fetch in kernel mode
[    8.480669][    T0] #PF: error_code(0x0010) - not-present page
[    8.486518][    T0] PGD 0 P4D 0 
[    8.489757][    T0] Oops: 0010 [#1] SMP KASAN PTI
[    8.494476][    T0] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G          I       5.9.0-rc6-next-20200925 #2
[    8.503987][    T0] Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10, BIOS U34 11/13/2019
[    8.513238][    T0] RIP: 0010:0x0
[    8.516562][    T0] Code: Bad RIP v

or

[    2.906744][    T0] ACPI: X2API32, address 0xfec68000, GSI 128-135
[    2.907063][    T0] IOAPIC[15]: apic_id 29, version 32, address 0xfec70000, GSI 136-143
[    2.907071][    T0] IOAPIC[16]: apic_id 30, version 32, address 0xfec78000, GSI 144-151
[    2.907079][    T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[    2.907084][    T0] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[    2.907100][    T0] Using ACPI (MADT) for SMP configuration information
[    2.907105][    T0] ACPI: HPET id: 0x8086a701 base: 0xfed00000
[    2.907116][    T0] ACPI: SPCR: console: uart,mmio,0x0,115200
[    2.907121][    T0] TSC deadline timer available
[    2.907126][    T0] smpboot: Allowing 144 CPUs, 0 hotplug CPUs
[    2.907163][    T0] [mem 0xd0000000-0xfdffffff] available for PCI devices
[    2.907175][    T0] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
[    2.914541][    T0] setup_percpu: NR_CPUS:256 nr_cpumask_bits:144 nr_cpu_ids:144 nr_node_ids:4
[    2.926109][   466 ecap f020df
[    9.134709][    T0] DMAR: DRHD base: 0x000000f5ffc000 flags: 0x0
[    9.140867][    T0] DMAR: dmar8: reg_base_addr f5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.149610][    T0] DMAR: DRHD base: 0x000000f7ffc000 flags: 0x0
[    9.155762][    T0] DMAR: dmar9: reg_base_addr f7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.164491][    T0] DMAR: DRHD base: 0x000000f9ffc000 flags: 0x0
[    9.170645][    T0] DMAR: dmar10: reg_base_addr f9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.179476][    T0] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    9.185626][    T0] DMAR: dmar11: reg_base_addr fbffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.194442][    T0] DMAR: DRHD base: 0x000000dfffc000 flags: 0x0
[    9.200587][    T0] DMAR: dmar12: reg_base_addr dfffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.209418][    T0] DMAR: DRHD base: 0x000000e1ffc000 flags: 0x0
[    9.215551][    T0] DMAR: dmar13: reg_base_addr e1ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    9.224367][    T0] DMAR: DRHD base: 0x000000e3ffc83][    T0]  msi_domain_alloc+0x8e/0x280
[    9.615015][    T0]  __irq_domain_a8992cd
[    9.711906][    T0] R10: ffffffff85407d78 R11: fffffbfff18992cc R12: ffffffff8546ffc0
[    9.719761][    T0] R13: 0000000000000098 R14: ffff888106e63a40 R15: 0000000000000001
[    9.727617][    T0] FS:  0000000000000000(0000) GS:ffff8887df800000(0000) knlGS:0000000000000000
[    9.736431][    T0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    9.742892][    T0] CR2: ffffffffffffffd6 CR3: 0000001ba7814001 CR4: 00000000000606b0
[    9.750747][    T0] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    9.758601][    T0] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    9.766456][    T0] Kernel panic - not syncing: Fatal exception
[    9.772547][    T0] ---[ end Kernel panic - not syncing: Fatal exception ]---

The working boot (without those patches) looks like this:

[    1.913963][    T0] ACPI: X2APIC_NMI (uid[0xf4] high level lint[0x1])
[    1.913967][    T0] ACPI: X2APIC_NMI (uid[0xf5] high level lint[0x1])
[    1.913970][    T0] ACPI: X2APIC_NMI (uid[0xf6] high level lint[0x1])
[    1.913974][    T0] ACPI: X2APIC_NMI (uid[0xf7] high level lint[0x1])
[    1.914017][    T0] IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
[    1.914032][    T0] IOAPIC[1]: apic_id 9, version 32, address 0xfec01000, GSI 24-31
[    1.914039][    T0] IOAPIC[2]: apic_id 10, version 32, address 0xfec08000, GSI 32-39
[    1.914047][    T0] IOAPIC[3]: apic_id 11, version 32, address 0xfec10000, GSI 40-47
[    1.914054][    T0] IOAPIC[4]: apic_id 12, version 32, address 0xfec18000, GSI 48-55
[    1.914062][    T0] IOAPIC[5]: apic_id 15, version 32, address 0xfec20000, GSI 56-63
[    1.[    7.994567][    T0] mempolicy: Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl
[    8.006541][    T0] ACPI: Core revision 20200717
[    8.019713][    T0] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 79635855245 ns
[    8.029672][    T0] APIC: Switch to symmetric I/O mode setup
[    8.035354][    T0] DMAR: Host address width 46
[    8.039915][    T0] DMAR: DRHD base: 0x000000e5ffc000 flags: 0x0
[    8.046095][    T0] DMAR: dmar0: reg_base_addr e5ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    8.054840][    T0] DMAR: DRHD base: 0x000000e7ffc000 flags: 0x0
[    8.060997][    T0] DMAR: dmar1: reg_base_addr e7ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    8.069740][    T0] DMAR: DRHD base: 0x000000e9ffc000 flags: 0x0
[    8.075872][    T0] DMAR: dmar2: reg_base_addr e9ffc000 ver 1:0 cap 8d2078c106f0466 ecap f020df
[    8.084615][    T0] DMAR: DRHD base: 0x000000ebffc000 flags: 0x0
[    8.090761][    T0] DMAR: dmar3: reg_base_addr ebffc000 ver 1:0 cap 8d2078c106f0466 ecap fMAR-IR: Enabled IRQ remapping in x2apic mode
[    8.513491][    T0] ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
[    8.568289][    T0] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x2b3e459bf4c, max_idle_ns: 440795289890 ns
[    8.579576][    T0] Calibrating delay loop (skipped), value calculated using timer frequency.. 6000.00 BogoMIPS (lpj=30000000)
[    8.589574][    T0] pid_max: default: 147456 minimum: 1152
[    8.714025][    T0] efi: memattr: Entry attributes invalid: RO and XP bits both cleared
[    8.719577][    T0] efi: memattr: ! 0x0000a057a000-0x0000a05b4fff [Runtime Code       |RUN|  |  |  |  |  |  |  |   |  |  |  |  ]
[    8.775355][    T0] Dentry cache hash table entries: 8388608 (order: 14, 67108864 bytes, vmalloc)
[    8.798868][    T0] Inode-cache hash table entries: 4194304 (order: 13, 33554432 bytes, vmalloc)
[    8.811550][    T0] Mount-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[    8.820076][    T0] Mountpoint-cache hash table entries: 131072 (order: 8, 1048576 bytes, vmalloc)
[    8.879327][    T0] mce: CPU0: Thermal mo[    8.996916][    T1] Performance Events: PEBS fmt3+, Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[    8.999591][    T1] ... version:                4
[    9.004310][    T1] ... bit width:              48
[    9.009118][    T1] ... generic registers:      4
[    9.009574][    T1] ... value mask:             0000ffffffffffff
[    9.015601][    T1] ... max period:             00007fffffffffff
[    9.019574][    T1] ... fixed-purpose events:   3
[    9.024294][    T1] ... event mask:             000000070000000f
[    9.034357][    T1] rcu: Hierarchical SRCU implementation.
[    9.062516][    T5] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
There are quite a bunch of issues to solve:

  - X86 does not use the device::msi_domain pointer for historical reasons
    and due to XEN, which makes it impossible to create an architecture
    agnostic device MSI infrastructure.

  - X86 has it's own msi_alloc_info data type which is pointlessly
    different from the generic version and does not allow to share code.

  - The logic of composing MSI messages in an hierarchy is busted at the
    core level and of course some (x86) drivers depend on that.

  - A few minor shortcomings as usual

This series addresses that in several steps:

 1) Accidental bug fixes

      iommu/amd: Prevent NULL pointer dereference

 2) Janitoring

      x86/init: Remove unused init ops
      PCI: vmd: Dont abuse vector irqomain as parent
      x86/msi: Remove pointless vcpu_affinity callback

 3) Sanitizing the composition of MSI messages in a hierarchy
 
      genirq/chip: Use the first chip in irq_chip_compose_msi_msg()
      x86/msi: Move compose message callback where it belongs

 4) Simplification of the x86 specific interrupt allocation mechanism

      x86/irq: Rename X86_IRQ_ALLOC_TYPE_MSI* to reflect PCI dependency
      x86/irq: Add allocation type for parent domain retrieval
      iommu/vt-d: Consolidate irq domain getter
      iommu/amd: Consolidate irq domain getter
      iommu/irq_remapping: Consolidate irq domain lookup

 5) Consolidation of the X86 specific interrupt allocation mechanism to be as
close
    as possible to the generic MSI allocation mechanism which allows to get
rid
    of quite a bunch of x86'isms which are pointless

      x86/irq: Prepare consolidation of irq_alloc_info
      x86/msi: Consolidate HPET allocation
      x86/ioapic: Consolidate IOAPIC allocation
      x86/irq: Consolidate DMAR irq allocation
      x86/irq: Consolidate UV domain allocation
      PCI/MSI: Rework pci_msi_domain_calc_hwirq()
      x86/msi: Consolidate MSI allocation
      x86/msi: Use generic MSI domain ops

  6) x86 specific cleanups to remove the dependency on arch_*_msi_irqs()

      x86/irq: Move apic_post_init() invocation to one place
      x86/pci: Reducde #ifdeffery in PCI init code
      x86/irq: Initialize PCI/MSI domain at PCI init time
      irqdomain/msi: Provide DOMAIN_BUS_VMD_MSI
      PCI: vmd: Mark VMD irqdomain with DOMAIN_BUS_VMD_MSI
      PCI/MSI: Provide pci_dev_has_special_msi_domain() helper
      x86/xen: Make xen_msi_init() static and rename it to xen_hvm_msi_init()
      x86/xen: Rework MSI teardown
      x86/xen: Consolidate XEN-MSI init
      irqdomain/msi: Allow to override msi_domain_alloc/free_irqs()
      x86/xen: Wrap XEN MSI management into irqdomain
      iommm/vt-d: Store irq domain in struct device
      iommm/amd: Store irq domain in struct device
      x86/pci: Set default irq domain in pcibios_add_device()
      PCI/MSI: Make arch_.*_msi_irq[s] fallbacks selectable
      x86/irq: Cleanup the arch_*_msi_irqs() leftovers
      x86/irq: Make most MSI ops XEN private
      iommu/vt-d: Remove domain search for PCI/MSI[X]
      iommu/amd: Remove domain search for PCI/MSI

  7) X86 specific preparation for device MSI

      x86/irq: Add DEV_MSI allocation type
      x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI

  8) Generic device MSI infrastructure
      platform-msi: Provide default irq_chip:: Ack
      genirq/proc: Take buslock on affinity write
      genirq/msi: Provide and use msi_domain_set_default_info_flags()
      platform-msi: Add device MSI infrastructure
      irqdomain/msi: Provide msi_alloc/free_store() callbacks

  9) POC of IMS (Interrupt Message Storm) irq domain and irqchip
     implementations for both device array and queue storage.

      irqchip: Add IMS (Interrupt Message Storm) driver - NOT FOR MERGING

Changes vs. V1:

   - Addressed various review comments and addressed the 0day fallout.
     - Corrected the XEN logic (Jürgen)
     - Make the arch fallback in PCI/MSI opt-in not opt-out (Bjorn)

   - Fixed the compose MSI message inconsistency

   - Ensure that the necessary flags are set for device SMI

   - Make the irq bus logic work for affinity setting to prepare
     support for IMS storage in queue memory. It turned out to be
     less scary than I feared.

   - Remove leftovers in iommu/intel|amd

   - Reworked the IMS POC driver to cover queue storage so Jason can have a
     look whether that fits the needs of MLX devices.

The whole lot is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git device-msi

This has been tested on Intel/AMD/KVM but lacks testing on:

    - HYPERV (-ENODEV)
    - VMD enabled systems (-ENODEV)
    - XEN (-ENOCLUE)
    - IMS (-ENODEV)

    - Any non-X86 code which might depend on the broken compose MSI message
      logic. Marc excpects not much fallout, but agrees that we need to fix
      it anyway.

#1 - #3 should be applied unconditionally for obvious reasons
#4 - #6 are wortwhile cleanups which should be done independent of device MSI

#7 - #8 look promising to cleanup the platform MSI implementation
     	independent of #8, but I neither had cycles nor the stomach to
     	tackle that.

#9	is obviously just for the folks interested in IMS

Thanks,

	tglx
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help