Re: [PATCH][v2] iommu: arm-smmu-v3: Copy SMMU table for kdump kernel | linux-arm-kernel

quoted

Hi Bjorn,

On Thu, Jun 4, 2020 at 5:32 AM Bjorn Helgaas [off-list ref] wrote:
On Wed, Jun 03, 2020 at 11:12:48PM +0530, Prabhakar Kushwaha wrote:
On Sat, May 30, 2020 at 1:03 AM Bjorn Helgaas [off-list ref] wrote:
On Fri, May 29, 2020 at 07:48:10PM +0530, Prabhakar Kushwaha wrote:
On Thu, May 28, 2020 at 1:48 AM Bjorn Helgaas [off-list ref] wrote:
On Wed, May 27, 2020 at 05:14:39PM +0530, Prabhakar Kushwaha wrote:
On Fri, May 22, 2020 at 4:19 AM Bjorn Helgaas [off-list ref] wrote:
On Thu, May 21, 2020 at 09:28:20AM +0530, Prabhakar Kushwaha wrote:
On Wed, May 20, 2020 at 4:52 AM Bjorn Helgaas [off-list ref] wrote:
On Thu, May 14, 2020 at 12:47:02PM +0530, Prabhakar Kushwaha wrote:
On Wed, May 13, 2020 at 3:33 AM Bjorn Helgaas [off-list ref] wrote:
On Mon, May 11, 2020 at 07:46:06PM -0700, Prabhakar Kushwaha wrote:
An SMMU Stream table is created by the primary kernel. This table is
used by the SMMU to perform address translations for device-originated
transactions. Any crash (if happened) launches the kdump kernel which
re-creates the SMMU Stream table. New transactions will be translated
via this new table..

There are scenarios, where devices are still having old pending
transactions (configured in the primary kernel). These transactions
come in-between Stream table creation and device-driver probe.
As new stream table does not have entry for older transactions,
it will be aborted by SMMU.

Similar observations were found with PCIe-Intel 82576 Gigabit
Network card. It sends old Memory Read transaction in kdump kernel.
Transactions configured for older Stream table entries, that do not
exist any longer in the new table, will cause a PCIe Completion Abort.
That sounds like exactly what we want, doesn't it?

Or do you *want* DMA from the previous kernel to complete?  That will
read or scribble on something, but maybe that's not terrible as long
as it's not memory used by the kdump kernel.
Yes, Abort should happen. But it should happen in context of driver.
But current abort is happening because of SMMU and no driver/pcie
setup present at this moment.
I don't understand what you mean by "in context of driver."  The whole
problem is that we can't control *when* the abort happens, so it may
happen in *any* context.  It may happen when a NIC receives a packet
or at some other unpredictable time.

Solution of this issue should be at 2 place
a) SMMU level: I still believe, this patch has potential to overcome
issue till finally driver's probe takeover.
b) Device level: Even if something goes wrong. Driver/device should
able to recover.

Returned PCIe completion abort further leads to AER Errors from APEI
Generic Hardware Error Source (GHES) with completion timeout.
A network device hang is observed even after continuous
reset/recovery from driver, Hence device is no more usable.
The fact that the device is no longer usable is definitely a problem.
But in principle we *should* be able to recover from these errors.  If
we could recover and reliably use the device after the error, that
seems like it would be a more robust solution that having to add
special cases in every IOMMU driver.

If you have details about this sort of error, I'd like to try to fix
it because we want to recover from that sort of error in normal
(non-crash) situations as well.
Completion abort case should be gracefully handled.  And device should
always remain usable.

There are 2 scenario which I am testing with Ethernet card PCIe-Intel
82576 Gigabit Network card.

I)  Crash testing using kdump root file system: De-facto scenario
    -  kdump file system does not have Ethernet driver
    -  A lot of AER prints [1], making it impossible to work on shell
of kdump root file system.
In this case, I think report_error_detected() is deciding that because
the device has no driver, we can't do anything.  The flow is like
this:

  aer_recover_work_func               # aer_recover_work
    kfifo_get(aer_recover_ring, entry)
    dev = pci_get_domain_bus_and_slot
    cper_print_aer(dev, ...)
      pci_err("AER: aer_status:")
      pci_err("AER:   [14] CmpltTO")
      pci_err("AER: aer_layer=")
    if (AER_NONFATAL)
      pcie_do_recovery(dev, pci_channel_io_normal)
        status = CAN_RECOVER
        pci_walk_bus(report_normal_detected)
          report_error_detected
            if (!dev->driver)
              vote = NO_AER_DRIVER
              pci_info("can't recover (no error_detected callback)")
            *result = merge_result(*, NO_AER_DRIVER)
            # always NO_AER_DRIVER
        status is now NO_AER_DRIVER

So pcie_do_recovery() does not call .report_mmio_enabled() or .slot_reset(),
and status is not RECOVERED, so it skips .resume().

I don't remember the history there, but if a device has no driver and
the device generates errors, it seems like we ought to be able to
reset it.
But how to reset the device considering there is no driver.
Hypothetically, this case should be taken care by PCIe subsystem to
perform reset at PCIe level.
I don't understand your question.  The PCI core (not the device
driver) already does the reset.  When pcie_do_recovery() calls
reset_link(), all devices on the other side of the link are reset.

We should be able to field one (or a few) AER errors, reset the
device, and you should be able to use the shell in the kdump kernel.
here kdump shell is usable only problem is a "lot of AER Errors". One
cannot see what they are typing.
Right, that's what I expect.  If the PCI core resets the device, you
should get just a few AER errors, and they should stop after the
device is reset.

    -  Note kdump shell allows to use makedumpfile, vmcore-dmesg applications.

II) Crash testing using default root file system: Specific case to
test Ethernet driver in second kernel
   -  Default root file system have Ethernet driver
   -  AER error comes even before the driver probe starts.
   -  Driver does reset Ethernet card as part of probe but no success.
   -  AER also tries to recover. but no success.  [2]
   -  I also tries to remove AER errors by using "pci=noaer" bootargs
and commenting ghes_handle_aer() from GHES driver..
          than different set of errors come which also never able to recover [3]
Please suggest your view on this case. Here driver is preset.
(driver/net/ethernet/intel/igb/igb_main.c)
In this case AER errors starts even before driver probe starts.
After probe, driver does the device reset with no success and even AER
recovery does not work.
This case should be the same as the one above.  If we can change the
PCI core so it can reset the device when there's no driver,  that would
apply to case I (where there will never be a driver) and to case II
(where there is no driver now, but a driver will probe the device
later).
Does this means change are required in PCI core.
Yes, I am suggesting that the PCI core does not do the right thing
here.

I tried following changes in pcie_do_recovery() but it did not help.
Same error as before.

-- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
        pci_info(dev, "broadcast resume message\n");
        pci_walk_bus(bus, report_resume, &status);
@@ -203,7 +207,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
        return status;

 failed:
        pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
+       pci_reset_function(dev);
+       pci_aer_clear_device_status(dev);
+       pci_aer_clear_nonfatal_status(dev);
Did you confirm that this resets the devices in question (0000:09:00.0
and 0000:09:00.1, I think), and what reset mechanism this uses (FLR,
PM, etc)?
Earlier reset  was happening with P2P bridge(0000:00:09.0) this the
reason no effect. After making following changes,  both devices are
now getting reset.
Both devices are using FLR.

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 117c0a2b2ba4..26b908f55aef 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,20 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       pci_save_state(dev);
+                       pci_cfg_access_lock(dev);
+
+                       /* Quiesce the device completely */
+                       pci_write_config_word(dev, PCI_COMMAND,
+                             PCI_COMMAND_INTX_DISABLE);
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "recovered via pci level
reset\n");
+                       }
Why do we need to save the state and quiesce the device?  The reset
should disable interrupts anyway.  In this particular case where
there's no driver, I don't think we should have to restore the state.
We maybe should *remove* the device and re-enumerate it after the
reset, but the state from before the reset should be irrelevant.
I tried pci_reset_function_locked without save/restore then I got the
synchronous abort during igb_probe (case 2 i.e. with driver). This is
100% reproducible.
looks like pci_reset_function_locked is causing PCI configuration
space random. Same is mentioned here
https://www.kernel.org/doc/html/latest/driver-api/pci/pci.html
That documentation is poorly worded.  A reset doesn't make the
contents of config space "random," but of course it sets config space
registers to their initialization values, including things like the
device BARs.  After a reset, the device BARs are zero, so it won't
respond at the address we expect, and I'm sure that's what's causing
the external abort.

So I guess we *do* need to save the state before the reset and restore
it (either that or enumerate the device from scratch just like we
would if it had been hot-added).  I'm not really thrilled with trying
to save the state after the device has already reported an error.  I'd
rather do it earlier, maybe during enumeration, like in
pci_init_capabilities().  But I don't understand all the subtleties of
dev->state_saved, so that requires some legwork.
I tried moving pci_save_state earlier. All observations are the same
as mentioned in earlier discussions.

Some modifications are required in pci_restore_state() as by default
it makes dev->state_saved = false after restore. .
So the next AER causes the earlier mentioned
crash(igb_get_invariants_82575 --> igb_rd32).  It is because
pci_restore_state() returns without restoring any state.

Code changes are below [1]

I don't think we should set INTX_DISABLE; the reset will make whatever
we do with it irrelevant anyway.
Yes.. It is not required.

Remind me why the pci_cfg_access_lock()?
I thought of the race conditions between AER (save/restore) and
igb_probe. So I added this.
It is not required as lock is inherently "taken care" in both AER (bus
walk) and igb_probe by the framework.

[1]
root@localhost$ git diff
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 595fcf59843f..35396eb4fd9e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1537,11 +1537,7 @@ static void pci_restore_rebar_state(struct pci_dev *pdev)
        }
 }

-/**
- * pci_restore_state - Restore the saved state of a PCI device
- * @dev: PCI device that we're dealing with
- */
-void pci_restore_state(struct pci_dev *dev)
+void __pci_restore_state(struct pci_dev *dev, int retain_state)
 {
        if (!dev->state_saved)
                return;
@@ -1572,10 +1568,26 @@ void pci_restore_state(struct pci_dev *dev)
        pci_enable_acs(dev);
        pci_restore_iov_state(dev);

-       dev->state_saved = false;
+       if (!retain_state)
+               dev->state_saved = false;
+}
+
+/**
+ * pci_restore_state - Restore the saved state of a PCI device
+ * @dev: PCI device that we're dealing with
+ */
+void pci_restore_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 0);
 }
 EXPORT_SYMBOL(pci_restore_state);

+void pci_restore_retain_state(struct pci_dev *dev)
+{
+       __pci_restore_state(dev, 1);
+}
+EXPORT_SYMBOL(pci_restore_retain_state);
+
 struct pci_saved_state {
        u32 config_space[16];
        struct pci_cap_saved_data cap[0];
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 14bb8f54723e..621eaa34bf9f 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -66,6 +66,13 @@ static int report_error_detected(struct pci_dev *dev,
                if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
                        vote = PCI_ERS_RESULT_NO_AER_DRIVER;
                        pci_info(dev, "can't recover (no
error_detected callback)\n");
+
+                       if (!__pci_reset_function_locked(dev)) {
+                               vote = PCI_ERS_RESULT_RECOVERED;
+                               pci_info(dev, "Recovered via pci level
reset\n");
+                       }
+
+                       pci_restore_retain_state(dev);
                } else {
                        vote = PCI_ERS_RESULT_NONE;
                }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 77b8a145c39b..af4e27c95421 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2448,6 +2448,8 @@ void pci_device_add(struct pci_dev *dev, struct
pci_bus *bus)

        pci_init_capabilities(dev);

+       pci_save_state(dev);
+
        /*
         * Add the device to our list of discovered devices
         * and the bus list for fixup functions, etc.
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 83ce1cdf5676..42ab7ef850b7 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1234,6 +1234,7 @@ void pci_unmap_rom(struct pci_dev *pdev, void
__iomem *rom);

 /* Power management related routines */
 int pci_save_state(struct pci_dev *dev);
+void pci_restore_retain_state(struct pci_dev *dev);
 void pci_restore_state(struct pci_dev *dev);
 struct pci_saved_state *pci_store_saved_state(struct pci_dev *dev);
 int pci_load_saved_state(struct pci_dev *dev,

--pk

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help