Re: ordering of call to unbind() in usbnet_disconnect
From: Lukas Wunner <lukas@wunner.de>
Date: 2022-03-26 13:01:53
On Sat, Mar 26, 2022 at 01:44:12PM +0100, Andrew Lunn wrote:
On Sat, Mar 26, 2022 at 01:25:52PM +0100, Lukas Wunner wrote:quoted
Oleksij, I cannot reproduce your stacktrace (included in full length below). I've tested with kernel 5.13 (since the stacktrace was with 5.13-rc3) with all your (and other people's) asix patches applied on top, except for 2c9d6c2b871d. Tried unplugging an AX88772A multiple times, never got a stacktrace. I've also walked down the code paths from usbnet_disconnect() and cannot see how the stacktrace could occur. Normally an unregistering netdev is removed from the linkwatch event list (lweventlist) via this call stack: usbnet_disconnect() unregister_netdev() rtnl_unlock() netdev_run_todo() netdev_wait_allrefs() linkwatch_forget_dev() linkwatch_do_dev() For the stacktrace to occur, the netdev would have to be subsequently re-added to the linkwatch event list via linkwatch_fire_event().What you might be missing is a call to phy_error()
But phy_error() has a WARN_ON(1) right at its top. So it produces a stacktrace itself. That stacktrace is nowhere to be seen in the dmesg output Oleksij posted. Hence it can't be caused by phy_error(). Also, recall that unregister_netdev() stops the netdev before unregistering it. That in turn causes an invocation of phy_stop() via ax88772_stop(). phy_stop() already puts the PHY into PHY_HALTED state and resets phydev->link = 0. So a subsequent phy_error() cannot result in a call to phy_link_down() (which would indeed trigger a dangerous linkwatch_fire_event()).
quoted
That is called, among other places, from netif_carrier_off(). However, netif_carrier_off() is already called *before* linkwatch_forget_dev() when unregister_netdev() stops the netdev before unregistering it: usbnet_disconnect() unregister_netdev() unregister_netdevice() unregister_netdevice_queue(dev, NULL) unregister_netdevice_many() dev_close_many() __dev_close_many() usbnet_stop() ax88772_stop() phy_stop() # state = PHY_HALTED phy_state_machine()I'm guessing somewhere around here: If it calls into the PHY driver, and the PHY calls for an MDIO bus transaction, and that returns an error, -ENODEV or -EIO for example, because the USB device has gone away, and that results in a call to phy_error().
Oleksij amended phy_state_machine() to bail out if err == -ENODEV
with commit 06edf1a940be ("net: phy: do not print dump stack if device
was removed"). The commit skips the phy_error() on -ENODEV, which
makes a lot of sense.
Thanks,
Lukas