Thread (6 messages) 6 messages, 4 authors, 2023-06-01

Re: [PATCH v4 22/23] PCI/AER: Forward RCH downstream port-detected errors to the CXL.mem dev handler

From: Bjorn Helgaas <helgaas@kernel.org>
Date: 2023-05-25 22:01:11
Also in: linux-cxl, linux-pci, lkml

On Thu, May 25, 2023 at 11:29:58PM +0200, Robert Richter wrote:
eOn 24.05.23 16:32:35, Bjorn Helgaas wrote:
quoted
On Tue, May 23, 2023 at 06:22:13PM -0500, Terry Bowman wrote:
quoted
From: Robert Richter <redacted>

In Restricted CXL Device (RCD) mode a CXL device is exposed as an
RCiEP, but CXL downstream and upstream ports are not enumerated and
not visible in the PCIe hierarchy. Protocol and link errors are sent
to an RCEC.

Restricted CXL host (RCH) downstream port-detected errors are signaled
as internal AER errors, either Uncorrectable Internal Error (UIE) or
Corrected Internal Errors (CIE). 
From the parallelism with RCD above, I first thought that RCH devices
were non-RCD mode and *were* enumerated as part of the PCIe hierarchy,
but actually I suspect it's more like the following?

  ... but CXL downstream and upstream ports are not enumerated and not
  visible in the PCIe hierarchy.

  Protocol and link errors from these non-enumerated ports are
  signaled as internal AER errors ... via a CXL RCEC.
Exactly, except the RCEC is standard PCIe and also must not
necessarily on the same PCI bus as the CXL RCiEPs are.
So make it "RCEC" instead of "CXL RCEC", I guess?  PCIe r6.0, sec
7.9.10.3, allows an RCEC to be associated with RCiEPs on different
buses, so nothing to see there.
quoted
quoted
The error source is the id of the RCEC.
This seems odd; I assume this refers to the RCEC's AER Error Source
Identification register, and the ERR_COR or ERR_FATAL/NONFATAL Source
Identification would ordinarily be the Requester ID of the RCiEP that
"sent" the Error Message.  But you're saying it's actually the ID of
the *RCEC*, not the RCiEP?
Right, the downstream port has its own AER ext capability in
non-config (io mapped) RCRB register range. Errors originating from
there are signaled as internal AER errors via the RCEC *with* the
RCEC's Requester ID. Code walks through all associated CXL endpoints,
determines the dport and checks its AER.

There is also an RDPAS structure defined in CXL but that is only a
different way to provide the RCEC to dport association instead of
using the RCEC's Endpoint Association Extended Capability. In the end
we get all associated RCHs and check the AER of all their dports.

The upstream port is signaled using the RCiEP's AER. CXL spec is
strict here: "Upstream Port RCRB shall not implement the AER Extended
Capability." The RCiEP's requestor ID is used then and its config
space the AER is in.

CXL.cachemem errors are reported with the RCiEP as requester
too. Status is in the CXL RAS cap and the UIE or CIE is set
respectively in the AER status of the RCiEP.
quoted
We're going to call pci_aer_handle_error() as well, to handle the
non-internal errors, and I'm pretty sure that path expects the RCiEP
ID there.

Whatever the answer, I'm not sure this sentence is actually relevant
to this patch, since this patch doesn't read PCI_ERR_ROOT_ERR_SRC or
look at struct aer_err_source.id.
The source id is used in aer_process_err_devices() which finally calls
handle_error_source() for the device with the requestor id. This is
the place where cxl_rch_handle_error() checks if it is an RCEC that
received an internal error and has cxl devices connected to it. Then,
the request is forwarded to the cxl_mem handler which also needs to
check the dport now. That is, pcie_walk_rcec() in
cxl_rch_handle_error() is called with the RCEC's pci handle,
cxl_rch_handle_error_iter() with the RCiEP's pci handle.
I'm still not sure this is relevant.  Isn't that last sentence just
the way we always use pcie_walk_rcec()?

If there's something *different* here about CXL, and it's important to
this patch, sure.  But I don't see that yet.  Maybe a comment in the
code if you think it's important to clarify something there.

Bjorn
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help