Re: [PATCH v2 2/8] powerpc/eeh: More relexed hotplug criterion
From: Gavin Shan <hidden>
Date: 2015-10-13 04:29:30
On Tue, Oct 13, 2015 at 01:48:54PM +1100, Daniel Axtens wrote:
Gavin Shan [off-list ref] writes:quoted
Danienl, The issue is tracked by IBM's bugzilla 127612 reported from Nvida private GPU drivers. I tried to find the source code from upstream kernel, but failed.OK. So I've read the internal bug, and I'm going to do my best to summarise without including confidential info. 1) A PHB with 2 devices is fenced via error injection. 2) The error_detected() callback is run on both devices. One returns CAN_RECOVER, the other returns NONE. We then fall through to partial-hotplug handling. (BTW this isn't documented in Documentation/PCI/pci-error-recovery.txt, so at some point this should be fixed!)
No hotplug is triggered when EEH core receives CAN_RECOVER. It seems the bug brought confusion instead of helping to explain the situation as I intended to. I was intended to say: there has driver which implements part of the EEH callbacks to collect diag-data.
Partial hotplug is detected by the presence of an err_handler, not by storing the result of error_detected. Would it be better to store the result from eeh_report_error in the eeh_dev structure, rather than by looking at more elements of the err_handler structure?
I don't see the benefit to do that. In eeh_report_error(), the specific error handlers still need to be checked and the result (from the check) is temporary, and not worthy to store that in eeh_dev. The current code looks good.
More generally, drivers using error_detect and then returning NONE as a way to get data and then not participate in EEH is a hack, and it's not surprising it's breaking in odd ways, especially with partial hotplug.
I think you're talking about the situation reported from the bug. It's CAN_RECOVER instead of NONE returned from error_detected(). With the CAN_RECOVER, the driver hopes the EEH core to enable the IO path so that it can collect diag-data from IO space at late point.
Partial hotplug is pretty hacky to begin with, and a driver being able to opt out of EEH selectively is a useful feature, so we probably want to redesign the state machine to handle them both better. That would be a long term project.
Thanks, Gavin
quoted
quoted
quoted
Signed-off-by: Gavin Shan <redacted> --- arch/powerpc/kernel/eeh_driver.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-)diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 3a626ed..32178a4 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c@@ -416,7 +416,10 @@ static void *eeh_rmv_device(void *data, void *userdata) driver = eeh_pcid_get(dev); if (driver) { eeh_pcid_put(dev); - if (driver->err_handler) + if (driver->err_handler && + driver->err_handler->error_detected && + driver->err_handler->slot_reset && + driver->err_handler->resume) return NULL; }-- 2.1.0 _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev