Re: [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Date: 2024-06-06 15:06:52
Also in:
linux-acpi, linux-cxl, linux-edac, linux-pci, lkml
On Thu, 9 May 2024 16:48:31 +0800 Zhenzhong Duan [off-list ref] wrote:
In some cases the detector of a Non-Fatal Error(NFE) is not the most appropriate agent to determine the type of the error. For example, when software performs a configuration read from a non-existent device or Function, completer will send an ERR_NONFATAL Message. On some platforms, ERR_NONFATAL results in a System Error, which breaks normal software probing. Advisory Non-Fatal Error(ANFE) is a special case that can be used in above scenario. It is predominantly determined by the role of the detecting agent (Requester, Completer, or Receiver) and the specific error. In such cases, an agent with AER signals the NFE (if enabled) by sending an ERR_COR Message as an advisory to software, instead of sending ERR_NONFATAL. When processing an ANFE, ideally both correctable error(CE) status and uncorrectable error(UE) status should be cleared. However, there is no way to fully identify the UE associated with ANFE. Even worse, Non-Fatal Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as NFE will reproduce above mentioned issue, i.e., breaking softwore probing; treating NFE as ANFE will make us ignoring some UEs which need active recover operation. To avoid clearing UEs that are not ANFE by accident, the most conservative route is taken here: If any of the NFE Detected bits is set in Device Status, do not touch UE status, they should be cleared later by the UE handler. Otherwise, a specific set of UEs that may be raised as ANFE according to the PCIe specification will be cleared if their corresponding severity is Non-Fatal. To achieve above purpose, store UNCOR_STATUS bits that might be ANFE in aer_err_info.anfe_status. So that those bits could be printed and processed later. Tested-by: Yudong Wang <redacted> Co-developed-by: "Wang, Qingshun" <redacted> Signed-off-by: "Wang, Qingshun" <redacted> Signed-off-by: Zhenzhong Duan <redacted>
Not my most confident review ever as this is nasty and gives me a headache but your description is good and I think the implementation looks reasonable. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>