Thread (21 messages) 21 messages, 3 authors, 2006-09-12

Re: pci error recovery procedure

From: Zhang, Yanmin <hidden>
Date: 2006-09-05 02:34:09
Also in: lkml

On Mon, 2006-09-04 at 17:03, Benjamin Herrenschmidt wrote:
quoted
As you know, all functions of a device share the same bus number and 5 bit dev number.
They just have different 3 bit function number. We could deduce if functions are in the same
device (slot).
Until you have a P2P bridge ...
quoted
Thanks. Now I understand why you specified mmio_enabled and slot_reset. They are just
to map to pSeries platform hardware operation steps. I know little about pSeries hardware,
but is it possible to merge such hardware steps from software point of view?
One of the ideas we had when defining those steps is to be precise
enough to let drivers who _can_ deal with those fine grained pSeries
step implement them, but also have the fallback to slot reset whenever
possible.

Now, if in practice, after actually implementing this in a number of
drivers, we see that slot reset is the only ever used path, then we
might want to simplify things a bit. I didn't want to impose that
restriction in the initial design though.
Thanks for your explanation. Now it's the time to delete mmio_enabled
and merge slot_reset with resume.
It's my understanding that doing no slot reset (hardware reset) but just
re-enabling MMIO, DMA and clearing pending error status in the PCI
config space is, as far as the driver is concerned, almost functionally
equivalent to a PCIe link reset. That is, the link reset might not (or
will not) actually reset the hardware beyond the PCIe link layer.
I agree.
Thus we could simplify the split between link reset / hard reset. The
former is an attempt at recovery with only resetting the PCI path to the
device, which on PCIe becomes a link reset, and on old PCI, just
clearing of the various error bits along the path (and on pSeries,
re-enabling MMIO and DMA access). However, there is still the problem
that if you do that, on pSeries at least, you really want to 1- enable
MMIO, 2- soft reset the card using MMIO, that is make sure all pending
DMA is stopped, and 3- re-enable DMA. While if we collapse that into a
single 'link reset' type of operation, we'll end up re-enabling MMIO and
DMA before the driver has a chance to stop pending DMA's
Is it the exclusive reason to have multi-steps?

1) Here link reset and hard reset are hardware operations, not the
link_reset and slot_reset callback in pci_error_handlers.

2) Callback error_detected will notify drivers there is PCI errors. Drivers
shouldn't do any I/O in error_detected.

3) If both the link and slot are reset after all error_detected are called,
the device should go back to initial status and all DMA should be stopped
automatically. Why does the driver still need a chance to stop DMA? The
error_detected of the drivers in the latest kernel who support err handlers
always returns PCI_ERS_RESULT_NEED_RESET. They are typical examples.
 and thus
increase the chance that we crap out due to a pending DMA on the chip.

Ben.
Above discussion is only about if mmio_enabled is needed.
As for slot_reset, I think it could be merged with resume, because platforms
do nothing between calling slot_reset and resume. Any comment?

Yanmin
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help