Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

[PATCH v6 0/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
[PATCH v6 1/5] PCI/DPC: Clarify naming for error port in DPC Handling · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
[PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Lukas Wunner <lukas@wunner.de> · 2025-10-20
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-20
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Lukas Wunner <lukas@wunner.de> · 2025-10-20
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-20
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Lukas Wunner <lukas@wunner.de> · 2025-10-23
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-24
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Lukas Wunner <lukas@wunner.de> · 2025-10-24
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-24
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Lukas Wunner <lukas@wunner.de> · 2025-10-24
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-24
Re: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control · Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> · 2025-10-20
[PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Lukas Wunner <lukas@wunner.de> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Lukas Wunner <lukas@wunner.de> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Lukas Wunner <lukas@wunner.de> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Lukas Wunner <lukas@wunner.de> · 2025-10-23
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-24
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-12-16
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> · 2025-10-20
Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-21
[PATCH v6 2/5] PCI/DPC: Run recovery on device that detected the error · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
[PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-15
Re: [PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status · Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> · 2025-10-20
Re: [PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status · Shuai Xue <xueshuai@linux.alibaba.com> · 2025-10-21

From: Shuai Xue <xueshuai@linux.alibaba.com>
Date: 2025-10-24 06:44:06
Also in: linux-pci, lkml


在 2025/10/23 18:48, Lukas Wunner 写道:

On Mon, Oct 20, 2025 at 11:20:58PM +0800, Shuai Xue wrote:

quoted

2025/10/20 22:24, Lukas Wunner:

quoted

On Mon, Oct 20, 2025 at 10:17:10PM +0800, Shuai Xue wrote:

quoted

     .slot_reset()
       => pci_restore_state()
         => pci_aer_clear_status()

This was added in 2015 by b07461a8e45b.  The commit claims that
the errors are stale and can be ignored.  It turns out they cannot.

So maybe pci_restore_state() should print information about the
errors before clearing them?

While that could work, we would lose the error severity information at

Wait, we've got that saved in pci_cap_saved_state, so we could restore
the severity register, report leftover errors, then clear those errors?

You're right that the severity register is also sticky, so we could
retrieve error severity directly from AER registers.

However, I have concerns about implementing this approach:

[...]

quoted

3. Architectural consistency: As you noted earlier, "pci_restore_state()
is only supposed to restore state, as the name implies, and not clear
errors." Adding error reporting to this function would further violate
this principle - we'd be making it do even more than just restore state.

Would you prefer I implement this broader change, or shall we proceed
with the targeted helper function approach for now? The helper function
solves the immediate problem while keeping the changes focused on the
AER recovery path.

My opinion is that b07461a8e45b was wrong and that reported errors
should not be silently ignored.

Thanks for your input and for discussing the history of commit
b07461a8e45b. I understand its intention to ignore errors specifically
during enumeration. As far as I know, AdvNonFatalErr events can occur in
this phase and typically should be ignored to simplify handling.

What I'd prefer is that if
pci_restore_state() discovers unreported errors, it asks the AER driver
to report them.

We've already got a helper to do that:  aer_recover_queue()
It queues up an entry in AER's kfifo and asks AER to report it.

So far the function is only used by GHES.  GHES allocates the
aer_regs argument from ghes_estatus_pool using gen_pool_alloc().
Consequently aer_recover_work_func() uses ghes_estatus_pool_region_free()
to free the allocation.  That prevents using aer_recover_queue()
for anything else than GHES.  It would first be necessary to
refactor aer_recover_queue() + aer_recover_work_func() such that
it can cope with arbitrary allocations (e.g. kmalloc()).

I agree that aer_recover_queue() and aer_recover_work_func() offer a
generalized way to report errors.

However, I’d like to highlight some concerns regarding error discovery
during pci_restore_state():

- Errors During Enumeration via Hotplug: Errors such as AdvNonFatalErr
   seen during enumeration or hotplug are generally intended to be
   ignored, as handling them adds unnecessary complexity without
   practical benefits.

- Errors During Downstream Port Containment (DPC): When an error is
   detected and not masked, it is expected to propagate through the usual
   AER path, either reported directly to the OS or to the firmware.
   Finally, these errors should be cleared and reported in a single
   cohesive step.

For missed fatal errors during DPC, queuing additional work to report
these errors using aer_recover_queue() could introduce significant
overhead. Specifically: It may result in the bus being reset and the
device reset again, which could unnecessarily disrupt system operation.

Do we really need the heavy way?

I would appreciate more feedback from the community on whether queuing
another recovery task for errors detected during pci_restore_state()

Thanks
Shuai

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help