Re: [PATCH 3/4] powerpc/eeh: Remove workaround from eeh_add_device_late()
From: Sam Bobroff <hidden>
Date: 2020-04-15 06:47:04
On Wed, Apr 08, 2020 at 04:53:36PM +1000, Oliver O'Halloran wrote:
On Wed, Apr 8, 2020 at 4:22 PM Sam Bobroff [off-list ref] wrote:quoted
On Fri, Apr 03, 2020 at 05:08:32PM +1100, Oliver O'Halloran wrote:quoted
On Mon, 2020-03-30 at 15:56 +1100, Sam Bobroff wrote:quoted
When EEH device state was released asynchronously by the device release handler, it was possible for an outstanding reference to prevent it's release and it was necessary to work around that if a device was re-discovered at the same PCI location.I think this is a bit misleading. The main situation where you'll hit this hack is when recovering a device with a driver that doesn't implement the error handling callbacks. In that case the device is removed, reset, then re-probed by the PCI core, but we assume it's the same physical device so the eeh_device state remains active. If you actually changed the underlying device I suspect something bad would happen.I'm not sure I understand. Isn't the case you're talking about caught by the earlier check (just above the patch)? if (edev->pdev == dev) { eeh_edev_dbg(edev, "Device already referenced!\n"); return; }No, in the case I'm talking about the pci_dev is torn down and freed(). After the PE is reset we re-probe the device and create a new pci_dev. If the release of the old pci_dev is delayed we need the hack this patch is removing.
Oh, yes, that is the case I was intending to change here. But I must be missing something, isn't it also the case that's changed by patch 2/4? What I intended was, after patch 2, eeh_remove_device() is called from the bus notifier so it happens imediately when recovery calls pci_stop_and_remove_bus_device(). Once it returns, edev->pdev has already been set to NULL by eeh_remove_device() so this case can't be hit anymore, and we should clean it up (this patch). (There is a slight difference in the way EEH_PE_KEEP is handled between the code removed here and the body of eeh_remove_device(), but checking and explaining that is already on my list for v2.) (I did test recovery on an unaware device and didn't hit the WARN_ON_ONCE().)
The check above should probably be a WARN_ON() since we should never be re-running the EEH probe on the same device. I think there is a case where that can happen, but I don't remember the details.
Yeah, I also certainly see the "Device already referenced!" message while debugging, and it would be good to track down.
Oliver
Attachments
- signature.asc [application/pgp-signature] 488 bytes