Thread (12 messages) 12 messages, 6 authors, 2023-04-18

Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Date: 2023-04-14 11:55:51
Also in: linux-cxl, linux-pci, lkml

On Fri, 14 Apr 2023 13:21:37 +0200
Robert Richter [off-list ref] wrote:
On 13.04.23 15:52:36, Ira Weiny wrote:
quoted
Jonathan Cameron wrote:  
quoted
On Wed, 12 Apr 2023 16:29:01 -0500
Bjorn Helgaas [off-list ref] wrote:
  
quoted
On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote:  
quoted
From: Robert Richter <redacted>
  
quoted
quoted
quoted
quoted
+static int __cxl_unmask_internal_errors(struct pci_dev *rcec)
+{
+	int aer, rc;
+	u32 mask;
+
+	/*
+	 * Internal errors are masked by default, unmask RCEC's here
+	 * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
+	 * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
+	 */    
Unmasking internal errors doesn't have anything specific to do with
CXL, so I don't think it should have "cxl" in the function name.
Maybe something like "pci_aer_unmask_internal_errors()".  
This reminds me.  Not sure we resolved earlier discussion on changing
the system wide policy to turn these on 
https://lore.kernel.org/linux-cxl/20221229172731.GA611562@bhelgaas/ (local)
which needs pretty much the same thing.

Ira, I think you were picking this one up?
https://lore.kernel.org/linux-cxl/63e5fb533f304_13244829412@iweiny-mobl.notmuch/ (local)  
After this discussion I posted an RFC to enable those errors.

https://lore.kernel.org/all/20230209-cxl-pci-aer-v1-1-f9a817fa4016@intel.com/ (local)
Ah. I'd forgotten that thread. Thanks!
quoted
Unfortunately the prevailing opinion was that this was unsafe.  And no one
piped up with a reason to pursue the alternative of a pci core call to enable
them as needed.

So I abandoned the work.

I think the direction things where headed was to have a call like:

int pci_enable_pci_internal_errors(struct pci_dev *dev)
{
	int pos_cap_err;
	u32 reg;

	if (!pcie_aer_is_native(dev))
		return -EIO;

	pos_cap_err = dev->aer_cap;

	/* Unmask correctable and uncorrectable (non-fatal) internal errors */
	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, &reg);
	reg &= ~PCI_ERR_COR_INTERNAL;
	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg);
	
	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, &reg);
	reg &= ~PCI_ERR_UNC_INTN;
	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg);
	
	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, &reg);
	reg &= ~PCI_ERR_UNC_INTN;
	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg);

	return 0;
}

... and call this from the cxl code where it is needed.  
The version I have ready after addressing Bjorn's comments is pretty
much the same, apart from error checking of the read/writes.

From your patch proposed you will need it in aer.c too and we do not
need to export it.
I think for the other components we'll want to call it from cxl_pci_ras_unmask()
so an export needed.

I also wonder if a more generic function would be better as seems likely
similar code will be needed for errors other than this pair.

This patch only enables it for (CXL) RCECs. You might want to extend
this for CXL endpoints (and ports?) then.
Definitely.  We have the same limitation you are seeing.  No errors
without turning this on.

Jonathan


quoted
Is this an acceptable direction?  Terry is welcome to steal the above from my
patch and throw it into the PCI core.

Looking at the current state of things I think cxl_pci_ras_unmask() may
actually be broken now without calling something like the above.  For that I
dropped the ball.  
Thanks,

-Robert
quoted
Ira  
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help