Re: ordering of call to unbind() in usbnet_disconnect
From: Andrew Lunn <andrew@lunn.ch>
Date: 2022-03-26 12:44:25
On Sat, Mar 26, 2022 at 01:25:52PM +0100, Lukas Wunner wrote:
On Tue, Mar 15, 2022 at 12:38:41PM +0100, Oleksij Rempel wrote:quoted
On Tue, Mar 15, 2022 at 09:32:34AM +0100, Lukas Wunner wrote:quoted
quoted
quoted
quoted
quoted
On Thu, Mar 10, 2022 at 12:25:08PM +0100, Oliver Neukum wrote:quoted
I got bug reports that 2c9d6c2b871d ("usbnet: run unbind() before unregister_netdev()") is causing regressions.Is it illegal to disconnect a PHY from an unregistered, but not yet freed net_device? Oleksij, the commit message of 2c9d6c2b871d says that disconnecting the PHY "fails" in that situation. Please elaborate what the failure looked like. Did you get a stacktrace?Oleksij, I cannot reproduce your stacktrace (included in full length below). I've tested with kernel 5.13 (since the stacktrace was with 5.13-rc3) with all your (and other people's) asix patches applied on top, except for 2c9d6c2b871d. Tried unplugging an AX88772A multiple times, never got a stacktrace. I've also walked down the code paths from usbnet_disconnect() and cannot see how the stacktrace could occur. Normally an unregistering netdev is removed from the linkwatch event list (lweventlist) via this call stack: usbnet_disconnect() unregister_netdev() rtnl_unlock() netdev_run_todo() netdev_wait_allrefs() linkwatch_forget_dev() linkwatch_do_dev() For the stacktrace to occur, the netdev would have to be subsequently re-added to the linkwatch event list via linkwatch_fire_event().
Hi Lukas What you might be missing is a call to phy_error()
That is called, among other places, from netif_carrier_off(). However,
netif_carrier_off() is already called *before* linkwatch_forget_dev()
when unregister_netdev() stops the netdev before unregistering it:
usbnet_disconnect()
unregister_netdev()
unregister_netdevice()
unregister_netdevice_queue(dev, NULL)
unregister_netdevice_many()
dev_close_many()
__dev_close_many()
usbnet_stop()
ax88772_stop()
phy_stop() # state = PHY_HALTED
phy_state_machine()
I'm guessing somewhere around here:
If it calls into the PHY driver, and the PHY calls for an MDIO bus
transaction, and that returns an error, -ENODEV or -EIO for example,
because the USB device has gone away, and that results in a call to
phy_error().
void phy_error(struct phy_device *phydev)
{
WARN_ON(1);
mutex_lock(&phydev->lock);
phydev->state = PHY_HALTED;
mutex_unlock(&phydev->lock);
phy_trigger_machine(phydev);
}
That will trigger the PHY state machine to run again, asynchronously.
The end of phy_stop() says:
/* Cannot call flush_scheduled_work() here as desired because
* of rtnl_lock(), but PHY_HALTED shall guarantee irq handler
* will not reenable interrupts.
*/
so it looks like the state machine will run again, and potentially use
netdev.
If the MDIO bus driver is no longer returning ENODEV, than we should
avoid this.
Andrew