RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

[RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Saheed O. Bolarinwa <hidden> · 2020-07-13
[RFC PATCH 13/35] cxl: Change PCIBIOS_SUCCESSFUL to 0 · Saheed O. Bolarinwa <hidden> · 2020-07-13
[RFC PATCH 27/35] powerpc: Tidy Success/Failure checks · Saheed O. Bolarinwa <hidden> · 2020-07-13
[RFC PATCH 26/35] powerpc: Change PCIBIOS_SUCCESSFUL to 0 · Saheed O. Bolarinwa <hidden> · 2020-07-13
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Arnd Bergmann <arnd@arndb.de> · 2020-07-13
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Bjorn Helgaas <helgaas@kernel.org> · 2020-07-14
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Kjetil Oftedal <hidden> · 2020-07-14
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Arnd Bergmann <arnd@arndb.de> · 2020-07-14
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Bjorn Helgaas <helgaas@kernel.org> · 2020-07-14
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Arnd Bergmann <arnd@arndb.de> · 2020-07-15
RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · David Laight <hidden> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Bjorn Helgaas <helgaas@kernel.org> · 2020-07-15
RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · David Laight <hidden> · 2020-07-16
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · "Oliver O'Halloran" <oohall@gmail.com> · 2020-07-15
RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · David Laight <hidden> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Bjorn Helgaas <helgaas@kernel.org> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2020-07-15
RE: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · David Laight <hidden> · 2020-07-16
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Rob Herring <robh@kernel.org> · 2020-07-14
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Benjamin Herrenschmidt <benh@kernel.crashing.org> · 2020-07-15
Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86 · Bjorn Helgaas <helgaas@kernel.org> · 2020-07-13

From: David Laight <hidden>
Date: 2020-07-15 14:38:35
Also in: linux-kernel-mentees, linux-pci, lkml, sparclinux

From: Oliver O'Halloran

Sent: 15 July 2020 05:19

On Wed, Jul 15, 2020 at 8:03 AM Arnd Bergmann [off-list ref] wrote:

...

quoted

- config space accesses are very rare compared to memory
  space access and on the hardware side the error handling
  would be similar, but readl/writel don't return errors, they just
  access wrong registers or return 0xffffffff.
  arch/powerpc/kernel/eeh.c has a ton extra code written to
  deal with it, but no other architectures do.

TBH the EEH MMIO hooks were probably a mistake to begin with. Errors
detected via MMIO are almost always asynchronous to the error itself
so you usually just wind up with a misleading stack trace rather than
any kind of useful synchronous error reporting. It seems like most
drivers don't bother checking for 0xFFs either and rely on the
asynchronous reporting via .error_detected() instead, so I have to
wonder what the point is. I've been thinking of removing the MMIO
hooks and using a background poller to check for errors on each PHB
periodically (assuming we don't have an EEH interrupt) instead. That
would remove the requirement for eeh_dev_check_failure() to be
interrupt safe too, so it might even let us fix all the godawful races
in EEH.

I've 'played' with PCIe error handling - without much success.
What might be useful is for a driver that has just read ~0u to
be able to ask 'has there been an error signalled for this device?'.

I got an error generated by doing an MMIO access that was inside
the address range forwarded to the slave, but outside any of its BARs.
(Two BARs of different sizes leaves a nice gap.)
This got reported up to the bridge nearest the slave (which supported
error handling), but not to the root bridge (which I don't think does).
ISTR a message about EEH being handled by the hardware (the machine
is up but dmesg is full of messages from a bouncing USB mouse).

With such partial error reporting useful info can still be extracted.

Of course, what actually happens on a PCIe error is that the signal
gets routed to some 'board support logic' and then passed back into
the kernel as an NMI - which then crashes the kernel!
This even happens when the PCIe link goes down after we've done a
soft-remove of the device itself!
Rather makes updating the board's FPGA without a reboot tricky.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help