Thread (32 messages) 32 messages, 4 authors, 2014-05-27

Re: [PATCH v6 2/3] drivers/vfio: EEH support for VFIO PCI device

From: Alex Williamson <hidden>
Date: 2014-05-23 14:37:05

On Fri, 2014-05-23 at 15:00 +1000, Benjamin Herrenschmidt wrote:
On Fri, 2014-05-23 at 14:37 +1000, Gavin Shan wrote:
quoted
quoted
There's no notification, the user needs to observe the return value an
poll?  Should we be enabling an eventfd to notify the user of the state
change?
Yes. The user needs to monitor the return value. we should have one notification,
but it's for later as we discussed :-)
 ../..
quoted
quoted
How does the guest learn about the error?  Does it need to?
When guest detects 0xFF's from reading PCI config space or IO, it's going
check the device (PE) state. If the device (PE) has been put into frozen
state, the recovery will be started.
Quick recap for Alex W (we discussed that with Alex G).

While a notification looks like a worthwhile addition in the long run, it
is not sufficient and not used today and I prefer that we keep that as something
to add later for those two main reasons:

 - First, the kernel itself isn't always notified. For example, if we implement
on top of an RTAS backend (PR KVM under pHyp) or if we are on top of PowerNV but
the error is a PHB "fence" (the entire PCI Host bridge gets fenced out in hardware
due to an internal error), then we get no notification. Only polling of the
hardware or firmware will tell us. Since we don't want to have a polling timer
in the kernel, that means that the userspace client of VFIO (or alternatively
the KVM guest) is the one that polls.

 - Second, this is how our primary user expects it: The primary (and only initial)
user of this will be qemu/KVM for PAPR guests and they don't have a notification
mechanism. Instead they query the EEH state after detecting an all 1's return from
MMIO or config space. This is how PAPR specifies it so we are just implementing the
spec here :-)

Because of these, I think we shouldn't worry too much about notification at
this stage.
Ok, I was asking more about an error log that indicates what error
occurred to freeze the hardware so that the user can make a more
educated guess whether recovery is an option.  Given that you have cases
where there may be no notification and your guest/user already handles
this, the plan to start with polling makes sense.  Thanks,

Alex
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help