Thread (32 messages) 32 messages, 4 authors, 2014-05-27

Re: [PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

From: Gavin Shan <hidden>
Date: 2014-05-24 02:06:25

On Fri, May 23, 2014 at 08:29:59AM -0600, Alex Williamson wrote:
On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
quoted
On Thu, May 22, 2014 at 09:10:53PM -0600, Alex Williamson wrote:
quoted
On Thu, 2014-05-22 at 18:23 +1000, Gavin Shan wrote:
.../...
No, sorry, I mean how does the user get information about the error?
The interface we have here is:
a) find that something bad has happened
b) kick it into working again
c) continue

How does the user figure out what happened and if it makes sense to
attempt to recover?  Where does the user learn that their disk is on
fire?
When 0xFF's returned from config or IO read, user should check the
device (PE)'s state with ioctl command VFIO_EEH_PE_GET_STATE. If the
device (PE) has been put into "frozen" state, It's confirmed the device
("disk" you mentioned) is on fire. User should kick off recovery, which
includes:

- User stops any operatins (config, IO, DMA) on the device because any
  PCI traffic to "frozen" device will be dropped from software or hardware
  level. Also, we don't expect DMA traffic during recovery. Otherwise,
  we will bump into recursive errors and the recovery should fail.
- VFIO_EEH_PE_SET_OPTION to enable I/O path ("DMA" path is still under frozen
  state). EEH_VFIO_PE_CONFIGURE to reconfigure affected PCI bridges and then
  do error log retrieval.
- VFIO_EEH_PE_RESET to reset the affected device (PE). EEH_VFIO_PE_CONFIUGRE
  to restore BARs.
- User resumes the device to start PCI traffic and device is brought to
  funtional state.

.../...
No, I prefer to stay consistent with the rest of the VFIO API and use
argsz + flags.
Here's the recap for previous reply: I have several cases for ioctl().

- ioctl(fd, cmd, NULL):   I needn't any input info.
- ioctl(fd, cmd, &data):  I need input info

For all the cases, should I simply have a data struct to include "argsz+flags"?

For return value from ioctl(), can we simply to have additional field in the
above data struct to carry it? "0" is the information I have to return for
some of the cases.


.../...
As agraf noted, I'm asking why reset and configure are separate when
they seem to be used together.
Ok. It's the recap: they're 2 separate steps of error recovery as
defined in PAPR spec. Also, they correspond to 2 separate RTAS calls.
So I don't think we can put them together.

Thanks,
Gavin
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help