Thread (17 messages) 17 messages, 4 authors, 2024-06-14

Re: [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Date: 2024-06-06 15:06:52
Also in: linux-acpi, linux-cxl, linux-edac, linux-pci, lkml

On Thu,  9 May 2024 16:48:31 +0800
Zhenzhong Duan [off-list ref] wrote:
In some cases the detector of a Non-Fatal Error(NFE) is not the most
appropriate agent to determine the type of the error. For example,
when software performs a configuration read from a non-existent
device or Function, completer will send an ERR_NONFATAL Message.
On some platforms, ERR_NONFATAL results in a System Error, which
breaks normal software probing.

Advisory Non-Fatal Error(ANFE) is a special case that can be used
in above scenario. It is predominantly determined by the role of the
detecting agent (Requester, Completer, or Receiver) and the specific
error. In such cases, an agent with AER signals the NFE (if enabled)
by sending an ERR_COR Message as an advisory to software, instead of
sending ERR_NONFATAL.

When processing an ANFE, ideally both correctable error(CE) status and
uncorrectable error(UE) status should be cleared. However, there is no
way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
NFE will reproduce above mentioned issue, i.e., breaking softwore probing;
treating NFE as ANFE will make us ignoring some UEs which need active
recover operation. To avoid clearing UEs that are not ANFE by accident,
the most conservative route is taken here: If any of the NFE Detected
bits is set in Device Status, do not touch UE status, they should be
cleared later by the UE handler. Otherwise, a specific set of UEs that
may be raised as ANFE according to the PCIe specification will be cleared
if their corresponding severity is Non-Fatal.

To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
in aer_err_info.anfe_status. So that those bits could be printed and
processed later.

Tested-by: Yudong Wang <redacted>
Co-developed-by: "Wang, Qingshun" <redacted>
Signed-off-by: "Wang, Qingshun" <redacted>
Signed-off-by: Zhenzhong Duan <redacted>
Not my most confident review ever as this is nasty and gives
me a headache but your description is good and I think the
implementation looks reasonable.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help