Re: ordering of call to unbind() in usbnet_disconnect

From: Andrew Lunn <andrew@lunn.ch>
Date: 2022-03-26 12:44:25

On Sat, Mar 26, 2022 at 01:25:52PM +0100, Lukas Wunner wrote:

On Tue, Mar 15, 2022 at 12:38:41PM +0100, Oleksij Rempel wrote:

quoted

On Tue, Mar 15, 2022 at 09:32:34AM +0100, Lukas Wunner wrote:

quoted

On Thu, Mar 10, 2022 at 12:25:08PM +0100, Oliver Neukum wrote:

quoted

I got bug reports that 2c9d6c2b871d ("usbnet: run unbind() before
unregister_netdev()") is causing regressions.

Is it illegal to disconnect a PHY from an unregistered, but not yet freed
net_device?

Oleksij, the commit message of 2c9d6c2b871d says that disconnecting the
PHY "fails" in that situation.  Please elaborate what the failure looked
like.  Did you get a stacktrace?

Oleksij, I cannot reproduce your stacktrace (included in full length below).
I've tested with kernel 5.13 (since the stacktrace was with 5.13-rc3)
with all your (and other people's) asix patches applied on top,
except for 2c9d6c2b871d.  Tried unplugging an AX88772A multiple times,
never got a stacktrace.

I've also walked down the code paths from usbnet_disconnect() and cannot
see how the stacktrace could occur.

Normally an unregistering netdev is removed from the linkwatch event list
(lweventlist) via this call stack:

          usbnet_disconnect()
            unregister_netdev()
              rtnl_unlock()
                netdev_run_todo()
                  netdev_wait_allrefs()
                    linkwatch_forget_dev()
                      linkwatch_do_dev()

For the stacktrace to occur, the netdev would have to be subsequently
re-added to the linkwatch event list via linkwatch_fire_event().

Hi Lukas

What you might be missing is a call to phy_error()

That is called, among other places, from netif_carrier_off().  However,
netif_carrier_off() is already called *before* linkwatch_forget_dev()
when unregister_netdev() stops the netdev before unregistering it:

          usbnet_disconnect()
            unregister_netdev()
              unregister_netdevice()
                unregister_netdevice_queue(dev, NULL)
                  unregister_netdevice_many()
                    dev_close_many()
                      __dev_close_many()
                        usbnet_stop()
                          ax88772_stop()
                            phy_stop() # state = PHY_HALTED
                              phy_state_machine()

I'm guessing somewhere around here:

If it calls into the PHY driver, and the PHY calls for an MDIO bus
transaction, and that returns an error, -ENODEV or -EIO for example,
because the USB device has gone away, and that results in a call to
phy_error().

void phy_error(struct phy_device *phydev)
{
        WARN_ON(1);

        mutex_lock(&phydev->lock);
        phydev->state = PHY_HALTED;
        mutex_unlock(&phydev->lock);

        phy_trigger_machine(phydev);
}

That will trigger the PHY state machine to run again, asynchronously.

The end of phy_stop() says:

        /* Cannot call flush_scheduled_work() here as desired because
         * of rtnl_lock(), but PHY_HALTED shall guarantee irq handler
         * will not reenable interrupts.
         */

so it looks like the state machine will run again, and potentially use
netdev.

If the MDIO bus driver is no longer returning ENODEV, than we should
avoid this.

      Andrew

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help