Re: [PATCH 12/14] powerpc/eeh: Add debugfs interface to run an EEH check
From: "Oliver O'Halloran" <oohall@gmail.com>
Date: 2019-09-17 03:40:23
On Tue, Sep 17, 2019 at 1:16 PM Sam Bobroff [off-list ref] wrote:
On Tue, Sep 03, 2019 at 08:16:03PM +1000, Oliver O'Halloran wrote:quoted
Detecting an frozen EEH PE usually occurs when an MMIO load returns a 0xFFs response. When performing EEH testing using the EEH error injection feature available on some platforms there is no simple way to kick-off the kernel's recovery process since any accesses from userspace (usually /dev/mem) will bypass the MMIO helpers in the kernel which check if a 0xFF response is due to an EEH freeze or not. If a device contains a 0xFF byte in it's config space it's possible to trigger the recovery process via config space read from userspace, but this is not a reliable method. If a driver is bound to the device an in use it will frequently trigger the MMIO check, but this is also inconsistent. To solve these problems this patch adds a debugfs file called "eeh_dev_check" which accepts a <domain>:<bus>:<dev>.<fn> string and runs eeh_dev_check_failure() on it. This is the same check that's done when the kernel gets a 0xFF result from an config or MMIO read with the added benifit that it can be reliably triggered from userspace. Signed-off-by: Oliver O'Halloran <oohall@gmail.com>Looks good, and I tested it with the next patch and it seems to work. But I think you should make it clear that this does not work with the hardware "EEH error injection" facility accessible via debugfs in err_injct (that doesn't seem clear to me from the commit message).
It's not intended to be a separate mechanisms in the long term. I'm planning on converting this interface to make use the platform defined error injection mechanism once I can find how to use the PAPR ones reliably. The idea is to use this as a generic "cause an EEH to happen on this device" interface for userspace which we can use in test scripts and the like.