Re: [PATCH v6 3/5] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
From: Lukas Wunner <lukas@wunner.de>
Date: 2025-10-23 10:48:47
Also in:
linux-pci, lkml
On Mon, Oct 20, 2025 at 11:20:58PM +0800, Shuai Xue wrote:
2025/10/20 22:24, Lukas Wunner:quoted
On Mon, Oct 20, 2025 at 10:17:10PM +0800, Shuai Xue wrote:quoted
quoted
quoted
.slot_reset() => pci_restore_state() => pci_aer_clear_status()This was added in 2015 by b07461a8e45b. The commit claims that the errors are stale and can be ignored. It turns out they cannot. So maybe pci_restore_state() should print information about the errors before clearing them?While that could work, we would lose the error severity information atWait, we've got that saved in pci_cap_saved_state, so we could restore the severity register, report leftover errors, then clear those errors?You're right that the severity register is also sticky, so we could retrieve error severity directly from AER registers. However, I have concerns about implementing this approach:
[...]
3. Architectural consistency: As you noted earlier, "pci_restore_state() is only supposed to restore state, as the name implies, and not clear errors." Adding error reporting to this function would further violate this principle - we'd be making it do even more than just restore state. Would you prefer I implement this broader change, or shall we proceed with the targeted helper function approach for now? The helper function solves the immediate problem while keeping the changes focused on the AER recovery path.
My opinion is that b07461a8e45b was wrong and that reported errors should not be silently ignored. What I'd prefer is that if pci_restore_state() discovers unreported errors, it asks the AER driver to report them. We've already got a helper to do that: aer_recover_queue() It queues up an entry in AER's kfifo and asks AER to report it. So far the function is only used by GHES. GHES allocates the aer_regs argument from ghes_estatus_pool using gen_pool_alloc(). Consequently aer_recover_work_func() uses ghes_estatus_pool_region_free() to free the allocation. That prevents using aer_recover_queue() for anything else than GHES. It would first be necessary to refactor aer_recover_queue() + aer_recover_work_func() such that it can cope with arbitrary allocations (e.g. kmalloc()). Thanks, Lukas