Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler
From: Bjorn Helgaas <helgaas@kernel.org>
Date: 2023-05-25 22:01:11
Also in:
linux-cxl, linux-pci, lkml
On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote:
eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:quoted
On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:quoted
From: Robert Richter <redacted> In Restricted CXL Device (RCD) mode a CXL device is exposed as an RCiEP, but CXL downstream and upstream ports are not enumerated and not visible in the PCIe hierarchy. Protocol and link errors are sent to an RCEC. Restricted CXL host (RCH) downstream port-detected errors are signaled as internal AER errors, either Uncorrectable Internal Error (UIE) or Corrected Internal Errors (CIE).From the parallelism with RCD above, I first thought that RCH devices were non-RCD mode and *were* enumerated as part of the PCIe hierarchy, but actually I suspect it's more like the following? ... but CXL downstream and upstream ports are not enumerated and not visible in the PCIe hierarchy. Protocol and link errors from these non-enumerated ports are signaled as internal AER errors ... via a CXL RCEC.Exactly, except the RCEC is standard PCIe and also must not necessarily on the same PCI bus as the CXL RCiEPs are.
So make it "RCEC" instead of "CXL RCEC", I guess? PCIe r6.0, sec 7.9.10.3, allows an RCEC to be associated with RCiEPs on different buses, so nothing to see there.
quoted
quoted
The error source is the id of the RCEC.This seems odd; I assume this refers to the RCEC's AER Error Source Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source Identification would ordinarily be the Requester ID of the RCiEP that "sent" the Error Message. But you're saying it's actually the ID of the *RCEC*, not the RCiEP?Right, the downstream port has its own AER ext capability in non-config (io mapped) RCRB register range. Errors originating from there are signaled as internal AER errors via the RCEC *with* the RCEC's Requester ID. Code walks through all associated CXL endpoints, determines the dport and checks its AER. There is also an RDPAS structure defined in CXL but that is only a different way to provide the RCEC to dport association instead of using the RCEC's Endpoint Association Extended Capability. In the end we get all associated RCHs and check the AER of all their dports. The upstream port is signaled using the RCiEP's AER. CXL spec is strict here: "Upstream Port RCRB shall not implement the AER Extended Capability." The RCiEP's requestor ID is used then and its config space the AER is in. CXL.cachemem errors are reported with the RCiEP as requester too. Status is in the CXL RAS cap and the UIE or CIE is set respectively in the AER status of the RCiEP.quoted
We're going to call pci_aer_handle_error() as well, to handle the non-internal errors, and I'm pretty sure that path expects the RCiEP ID there. Whatever the answer, I'm not sure this sentence is actually relevant to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or look at struct aer_err_source.id.The source id is used in aer_process_err_devices() which finally calls handle_error_source() for the device with the requestor id. This is the place where cxl_rch_handle_error() checks if it is an RCEC that received an internal error and has cxl devices connected to it. Then, the request is forwarded to the cxl_mem handler which also needs to check the dport now. That is, pcie_walk_rcec() in cxl_rch_handle_error() is called with the RCEC's pci handle, cxl_rch_handle_error_iter() with the RCiEP's pci handle.
I'm still not sure this is relevant. Isn't that last sentence just the way we always use pcie_walk_rcec()? If there's something *different* here about CXL, and it's important to this patch, sure. But I don't see that yet. Maybe a comment in the code if you think it's important to clarify something there. Bjorn