Thread (30 messages) 30 messages, 3 authors, 2025-12-16

Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

From: Shuai Xue <xueshuai@linux.alibaba.com>
Date: 2025-10-20 12:59:10
Also in: linux-pci, lkml


在 2025/10/20 18:10, Lukas Wunner 写道:
On Wed, Oct 15, 2025 at 10:41:57AM +0800, Shuai Xue wrote:
quoted
+++ b/drivers/pci/pcie/err.c
@@ -253,6 +254,16 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  			pci_warn(bridge, "subordinate device reset failed\n");
  			goto failed;
  		}
+
+		/* Link recovered, report fatal errors of RCiEP or EP */
+		if (state == pci_channel_io_frozen &&
+		    (type == PCI_EXP_TYPE_ENDPOINT || type == PCI_EXP_TYPE_RC_END)) {
+			aer_add_error_device(&info, dev);
+			info.severity = AER_FATAL;
+			if (aer_get_device_error_info(&info, 0, true))
+				aer_print_error(&info, 0);
+			pci_dev_put(dev);
+		}
  	}
Hi, Lukas,
Where is the the pci_dev_get() to balance the pci_dev_put() here?
The corresponding pci_dev_get() is called in add_error_device(). Please
refer to commit 60271ab044a5 ("PCI/AER: Take reference on error
devices") which introduced this reference counting mechanism.
It feels awkward to leak AER-specific details into pcie_do_recovery().
That function is supposed to implement the flow described in
Documentation/PCI/pci-error-recovery.rst in a platform-agnostic way
so that powerpc (EEH) and s390 could conceivably take advantage of it.

Can you find a way to avoid this, e.g. report errors after
pcie_do_recovery() has concluded?
I understand your concern about keeping pcie_do_recovery()
platform-agnostic. I explored the possibility of reporting errors after
recovery concludes, but unfortunately, this approach isn't feasible due
to the recovery sequence. The issue is that most drivers'
pci_error_handlers implement .slot_reset() which internally calls
pci_restore_state() to restore the device's configuration space and
state. This function also clears the device's AER status registers:

   .slot_reset()
     => pci_restore_state()
       => pci_aer_clear_status()

Therefore, the only window to capture and report the original error
information is between link recovery (after reset_subordinates()) and
before .slot_reset() is called. Once .slot_reset() executes, the error
status is cleared and lost forever.
I'm also worried that errors are reported *during* recovery.
I imagine this looks confusing to a user.  The logged messages
should make it clear that these are errors that occurred *earlier*
and are reported belatedly.
You raise an excellent point about potential user confusion. The current
aer_print_error() interface doesn't indicate that these are historical
errors being reported belatedly. Would it be acceptable to add a
clarifying message before calling aer_print_error()? For example:

   pci_err(dev, "Reporting error that occurred before recovery:\n");

Alternatively, if you have suggestions for a better approach to make
this timing clear to users, I'd be happy to implement them.
Thanks,

Lukas
Thanks for valuable comments.

Best Regards,
Shuai
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help