Re: pci error recovery procedure
From: Zhang, Yanmin <hidden>
Date: 2006-09-07 01:58:14
Also in:
lkml
On Thu, 2006-09-07 at 04:01, Linas Vepstas wrote:
On Wed, Sep 06, 2006 at 09:26:56AM +0800, Zhang, Yanmin wrote:quoted
quoted
quoted
The error_detected of the drivers in the latest kernel who support err handlers always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.Just because the current drivers do it this way does not mean that this is the best way to do things.If it's not the best way, why did you choose to reset slot for e1000/e100/ipr error handlers? They are typical widely-used devices. To make it easier to add error handlers?I did it that way just to get going, get something working. I do not have hardware specs for any of these devices, and do not have much of an idea of what they are capable of;
Yes, it's difficult to add fine-grained error handlers for guys who are not the driver developers.
the recovery code I wrote is of "brute force, hit it with a hammer"-nature. Driver writers who know thier hardware well, and are interested in a more refined approach are encouraged to actualy use a more refined approach.
I guess almost no driver developer is happy to spend lots of time to add refined steps. They would like to focus on normal process (for achievement feeling? :) ). In addition, if they use fine-grained steps in error handlers, all these steps might be rewritten when the device specs is upgraded. Fine-grained steps in error handlers are more difficut to debug. It's impossible for you to develop error handlers for all device drivers. The error handlers look a little like suspend/resume. Of course, it's more complicated. If we could keep it as simple as suspend/resume, it's more welcomed. pci error shouldn't happen frequently. And when it happens, I think mostly it's an endpoint device instead of bridge. When it happens, if we choose always reset slot, performance could be degraded, but not too much. I just deduce, and didn't test it on a machine with hundreds of devices.