Re: [PATCH] powerpc/eeh: Only dump stack once if an MMIO loop is detected
From: Sam Bobroff <hidden>
Date: 2019-10-16 03:46:45
On Wed, Oct 16, 2019 at 12:25:36PM +1100, Oliver O'Halloran wrote:
Many drivers don't check for errors when they get a 0xFFs response from an MMIO load. As a result after an EEH event occurs a driver can get stuck in a polling loop unless it some kind of internal timeout logic. Currently EEH tries to detect and report stuck drivers by dumping a stack trace after eeh_dev_check_failure() is called EEH_MAX_FAILS times on an already frozen PE. The value of EEH_MAX_FAILS was chosen so that a dump would occur every few seconds if the driver was spinning in a loop. This results in a lot of spurious stack traces in the kernel log. Fix this by limiting it to printing one stack trace for each PE freeze. If the driver is truely stuck the kernel's hung task detector is better suited to reporting the probelm anyway.
problem
Cc: Sam Bobroff <redacted> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Looks good to me (especially because if it's stuck in a loop the stack trace is going to be pretty much the same every time). I tested it by recovering a device that uses the mlx5_core driver. Reviewed-by: Sam Bobroff <redacted> Tested-by: Sam Bobroff <redacted>
quoted hunk ↗ jump to hunk
--- arch/powerpc/kernel/eeh.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c index bc8a551013be..c35069294ecf 100644 --- a/arch/powerpc/kernel/eeh.c +++ b/arch/powerpc/kernel/eeh.c@@ -503,7 +503,7 @@ int eeh_dev_check_failure(struct eeh_dev *edev) rc = 1; if (pe->state & EEH_PE_ISOLATED) { pe->check_count++; - if (pe->check_count % EEH_MAX_FAILS == 0) { + if (pe->check_count == EEH_MAX_FAILS) { dn = pci_device_to_OF_node(dev); if (dn) location = of_get_property(dn, "ibm,loc-code",-- 2.21.0
Attachments
- signature.asc [application/pgp-signature] 488 bytes