Re: [PATCH v3 1/2] PCI/AER: Fix missing uevent on recovery when a reset is requested
From: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
Date: 2025-07-31 17:04:48
Also in:
linux-pci, linux-s390, lkml
On 7/31/25 6:01 AM, Lukas Wunner wrote:
quoted hunk ↗ jump to hunk
On Wed, Jul 30, 2025 at 10:24:07PM +0200, Lukas Wunner wrote:quoted
On Wed, Jul 30, 2025 at 10:01:50PM +0200, Lukas Wunner wrote:quoted
On Wed, Jul 30, 2025 at 01:20:57PM +0200, Niklas Schnelle wrote:quoted
Since commit 7b42d97e99d3 ("PCI/ERR: Always report current recovery status for udev") AER uses the result of error_detected() as parameter to pci_uevent_ers(). As pci_uevent_ers() however does not handle PCI_ERS_RESULT_NEED_RESET this results in a missing uevent for the beginning of recovery if drivers request a reset. Fix this by treating PCI_ERS_RESULT_NEED_RESET as beginning recovery.[...]quoted
+++ b/drivers/pci/pci-driver.c@@ -1592,6 +1592,7 @@ void pci_uevent_ers(struct pci_dev *pdev, enum pci_ers_result err_type) switch (err_type) { case PCI_ERS_RESULT_NONE: case PCI_ERS_RESULT_CAN_RECOVER: + case PCI_ERS_RESULT_NEED_RESET: envp[idx++] = "ERROR_EVENT=BEGIN_RECOVERY"; envp[idx++] = "DEVICE_ONLINE=0"; break;I note that PCI_ERS_RESULT_NO_AER_DRIVER is also missing in that switch/case statement. I guess for the patch to be complete, it needs to be added to the PCI_ERS_RESULT_DISCONNECT case. Do you agree?I realize now there's a bigger problem here: In pcie_do_recovery(), when control reaches the "failed:" label, a uevent is only signaled for the *bridge*. Shouldn't a uevent instead be signaled for every device *below* the bridge? (And possibly the bridge itself if it was the device reporting the error.)The small patch below should resolve this issue. Please let me know what you think.quoted
In that case you don't need to add PCI_ERS_RESULT_NO_AER_DRIVER to the switch/case statement because we wouldn't want to have multiple uevents reporting disconnect, so the one emitted below the "failed:" label would be sufficient.I'll send a separate Reviewed-by for your original patch as the small patch below should resolve my concern about PCI_ERS_RESULT_NO_AER_DRIVER.quoted
This all looks so broken that I'm starting to wonder if there's any user space application at all that takes advantage of these uevents?I'd still be interested to know which user space application you're using to track these uevents? Thanks, Lukas -- >8 --diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index e795e5ae..3a95aa2 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c@@ -165,6 +165,12 @@ static int report_resume(struct pci_dev *dev, void *data) return 0; } +static int report_disconnect(struct pci_dev *dev, void *data) +{ + pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT); + return 0; +}
Since you are notifying the user space, I am wondering whether the drivers should be notified about the recovery failure?
quoted hunk ↗ jump to hunk
+ /** * pci_walk_bridge - walk bridges potentially AER affected * @bridge: bridge which may be a Port, an RCEC, or an RCiEP@@ -272,7 +278,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, failed: pci_walk_bridge(bridge, pci_pm_runtime_put, NULL); - pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT); + pci_walk_bridge(bridge, report_disconnect, NULL); pci_info(bridge, "device recovery failed\n");
-- Sathyanarayanan Kuppuswamy Linux Kernel Developer