Re: Questions: Should kernel panic when PCIe fatal error occurs?
From: "Oliver O'Halloran" <oohall@gmail.com>
Date: 2023-09-25 03:54:24
Also in:
linux-acpi, linux-pci, lkml
From: "Oliver O'Halloran" <oohall@gmail.com>
Date: 2023-09-25 03:54:24
Also in:
linux-acpi, linux-pci, lkml
On Fri, Sep 22, 2023 at 8:23 AM David Laight [off-list ref] wrote:
quoted
It would be nice if they worked the same, but I suspect that vendors may rely on the fact that CPER_SEV_FATAL forces a restart/panic as part of their system integrity story.The file system errors created by a panic (especially an NMI panic) could easily be more problematic than a failed PCIe data transfer. Evan a read that returned ~0u - which can be checked for. Panicking a system that is converting TDM telephony to RTP for the 911 emergency service because a PCIe cable/riser connecting one of the TDM board has become loose doesn't seem ideal.
For kernel native AER the default reaction to errors is reset-and-reinit which probably isn't much better for your case. Sounds like you would want a knob to suppress everything except error reporting so you can handle it in userspace?
(Or because the TDM board's fpga has decided it isn't going to respond to any accesses until the BARs are setup again...) The system can carry on with some TDM connections disabled - but that is ok because they are all duplicated in case a cable gets cuit.
Well that's a relief :) Oliver