Re: [PATCH 2/2] ACPI: extlog: Trace CPER PCI Express Error Section
From: Dan Williams <hidden>
Date: 2024-08-07 20:31:38
Also in:
linux-acpi, linux-pci, lkml
Dan Williams wrote:
[ add Boris ]
[ actually add Boris ] Boris, see below, thoughts on deprecating acpi_extlog...
Bjorn Helgaas wrote:quoted
On Mon, May 27, 2024 at 04:43:41PM +0200, Fabio M. De Francesco wrote:quoted
Currently, extlog_print() (ELOG) only reports CPER PCIe section (UEFI v2.10, Appendix N.2.7) to the kernel log via print_extlog_rcd(). Instead, the similar ghes_do_proc() (GHES) prints to kernel log and calls pci_print_aer() to report via the ftrace infrastructure. Add support to report the CPER PCIe Error section also via the ftrace infrastructure by calling pci_print_aer() to make ELOG act consistently with GHES. Cc: Dan Williams <redacted> Signed-off-by: Fabio M. De Francesco <redacted> --- drivers/acpi/acpi_extlog.c | 30 ++++++++++++++++++++++++++++++ drivers/pci/pcie/aer.c | 2 +- include/linux/aer.h | 13 +++++++++++-- 3 files changed, 42 insertions(+), 3 deletions(-)diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c index e025ae390737..007ce96f8672 100644 --- a/drivers/acpi/acpi_extlog.c +++ b/drivers/acpi/acpi_extlog.c@@ -131,6 +131,32 @@ static int print_extlog_rcd(const char *pfx, return 1; } +static void extlog_print_pcie(struct cper_sec_pcie *pcie_err, + int severity) +{ + struct aer_capability_regs *aer; + struct pci_dev *pdev; + unsigned int devfn; + unsigned int bus; + int aer_severity; + int domain; + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) { + aer_severity = cper_severity_to_aer(severity); + aer = (struct aer_capability_regs *)pcie_err->aer_info; + domain = pcie_err->device_id.segment; + bus = pcie_err->device_id.bus; + devfn = PCI_DEVFN(pcie_err->device_id.device, + pcie_err->device_id.function); + pdev = pci_get_domain_bus_and_slot(domain, bus, devfn); + if (!pdev) + return; + pci_print_aer(pdev, aer_severity, aer); + pci_dev_put(pdev); + }I'm 100% in favor of making error reporting work and look the same across GHES and ELOG. But I do have to gripe a bit... It's already unfortunate that GHES and the native AER handling are separate paths that lead to the same place (__aer_print_error()). I'm sorry that we need to add a third path that again does fundamentally the same thing. The fact that they're separate means all the design, reviewing, testing, and maintenance effort is diluted, and error handling always gets too little love in the first place. I think this is a recipe for confusion. ghes_do_proc # GHES apei_estatus_for_each_section ... if (guid_equal(sec_type, &CPER_SEC_PCIE)) ghes_handle_aer cper_severity_to_aer aer_recover_queue kfifo_in_spinlocked(&aer_recover_ring) # add to queue aer_recover_work_func # another thread kfifo_get(&aer_recover_ring) # remove from queue pci_print_aer __aer_print_error <--- aer_irq # native AER kfifo_put(&aer_fifo) # add to queue aer_isr # another thread kfifo_get(&aer_fifo) # remove from queue ... aer_isr_one_error aer_process_err_devices aer_print_error __aer_print_error <--- extlog_print # extlog (x86 only) apei_estatus_for_each_section ... if (guid_equal(sec_type, &CPER_SEC_PCIE)) extlog_print_pcie cper_severity_to_aer pci_get_domain_bus_and_slot pci_print_aer __aer_print_error <--- And we also have CXL paths that lead to __aer_print_error(), although it seems like they at least start in the native AER (and maybe GHES?) path and branch out somewhere. My head is spinning. Do I *object* to this patch? No, not really; it's a trivial change in drivers/pci/, and Rafael can add my Acked-by: Bjorn Helgaas [off-list ref] as needed. But I am afraid we're making ourselves a maintenance headache.To be honest, I am too. Upon discovering that extlog_print() behaves differently than ghes_do_proc(), I had the snarky thought "great, can we now just go ahead and deprecate the extlog path, it's just a source of maintenance pain.". So *if*we keep acpi_extlog it then I definitely think it should be consistent with other CPER handlers (needs this patch). But, I am also open to entertaining "deprecate it". To me, the fact that ras_userspace_consumers() is only honored for acpi_extlog is clear evidence that the kernel has already painted itself into a confusing user ABI corner and maybe the proper path forward at this point is to cut acpi_extlog loose.