RE: Questions: Should kernel panic when PCIe fatal error occurs?

From: David Laight <hidden>
Date: 2023-09-25 08:07:52
Also in: linux-acpi, linux-pci, lkml

From: Shuai Xue

Sent: 25 September 2023 02:44

On 2023/9/21 21:20, David Laight wrote:

quoted

...
I've got a target to generate AER errors by generating read cycles
that are inside the address range that the bridge forwards but
outside of any BAR because there are 2 different sized BARs.
(Pretty easy to setup.)
On the system I was using they didn't get propagated all the way
to the root bridge - but were visible in the lower bridge.

So how did you observe it? If the error message does not propagate
to the root bridge, I think no AER interrupt will be trigger.

I looked at the internal registers (IIRC in PCIe config space)
of the intermediate bridge.
I don't think the root bridge on that system supported AER.
(I was testing the generation of AER indications by our fpga.)

quoted

It would be nice for a driver to be able to detect/clear such
a flag if it gets an unexpected ~0u read value.
(I'm not sure an error callback helps.)

IMHO, a general model is that error detected at endpoint should be
routed to upstream port for example: RCiEP route error message to RCEC,
so that the AER port service could handle the error, the device driver
only have to implement error handler callback.

The problem is that that and callback is too late for something
triggered by a PCIe read.
The driver has to detect that the value is 'dubious' and wants
a method of detecting whether there was an associated AER (or other)
error.
If the AER indication is routed through some external entity (like
board management hardware) there will be additional latency that
means that the associated interrupt (even if an NMI) may not have
been processed when the driver code is trying to determine what
happened.
This can only be made worse by the interrupt coming in on a
different cpu.

quoted

OTOH a 'nebs compliant' server routed any kind of PCIe link error
through to some 'system management' logic that then raised an NMI.
I'm not sure who thought an NMI was a good idea - they are pretty
impossible to handle in the kernel and too late to be of use to
the code performing the access.

I think it is the responsibility of the device to prevent the spread of
errors while reporting that errors have been detected. For example, drop
the current, (drain submit queue) and report error in completion record.

Eh?
I can generate two types of PCIe error:
- Read/write requests for addresses that aren't inside a BAR.
- Link failures that cause retraining and might need config
  space reconfiguring.

Both NMI and MSI are asynchronous interrupts.

Indeed, which makes neither of them suitable for any indication
relating to a bus cycle failure.

quoted

In any case we were getting one after 'echo 1 >xxx/remove' and
then taking the PCIe link down by reprogramming the fpga.
So the link going down was entirely expected, but there seemed
to be nothing we could do to stop the kernel crashing.

I'm sure 'nebs compliant' ought to contain some requirements for
resilience to hardware failures!

How the kernel crash after a link down? Did the system detect a surprise
down error?

It was a couple of years ago..
IIRC the 'link down' cause the hub to generate an AER error.
The root hub forwarded it to some 'board management hardware/software'
that then raised and NMI.
The kernel crashed because of an unexpected NMI.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help