Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex
From: Hans de Goede <hidden>
Date: 2024-06-01 20:05:59
Also in:
linux-leds, lkml, regressions
Hi All, On 5/31/24 12:22 PM, Hans de Goede wrote:
Hi, On 5/31/24 11:50 AM, Hans de Goede wrote:quoted
Hi, On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote:quoted
[adding the LED folks and the regressions list to the list of recipients] Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting for once, to make this easily accessible to everyone. Lee, Pavel, could you look into below regression report please? Thread starts here: https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ (local) Another report with somewhat similar symptom can be found here: https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ (local) See also Russell's analysis of that report below (many many thx for that, much appreciated Russel!). To my untrained eyes all of this sounds a lot like we still have a 6.9 regression related to the LED code somewhere. Reminder, we had earlier trouble, but that was avoided through other measures: * 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") / https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ (local) * c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") / https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ (local) * 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal") That iwlwifi commit even calls it self "work around". The developer that submitted it bisected the problem to a LED merge, but sadly that was the end of it. :-/I actually have been looking at a ledtrig-netdev lockdep warning yesterday which I believe is the same thing. I'll include the lockdep trace below. According to lockdep there indeed is a ABBA (ish) cyclic deadlock with the rtnl mutex vs led-triggers related locks. I believe that this problem may be a pre-existing problem but this now actually gets hit in kernels >= 6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is given, make hw_control trigger the default trigger"). Before that commit the "netdev" trigger would not be bound / set as phy LEDs trigger by default. +Cc Heiner Kallweit who authored that commit. The netdev trigger typically is not needed because the PHY LEDs are typically under hw-control and the netdev trigger even tries to leave things that way so setting it as the active trigger for the LED class device is basically a no-op. I guess the goal of that commit is correctly have the triggers file content reflect that the LED is controlled by a netdev and to allow changing the hw-control mode without the user first needing to set netdev as trigger before being able to change the mode. But there is a price to this, besides the locking problem this also causes the ledtrig-netdev module to load on pretty much everyones systems (when build as a module) even though 99.999% of our users likely does not need this at all... Given this price and the troubles this is causing I think it might be best to revert 66601a29bb23. There might still be a locking issue when setting the trigger to netdev manually (I'll check and follow up) but this should fix the regression users are hitting since typically users do not set the trigger manually.Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23 reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev" (to add it to the list of available triggers) and then setting the trigger for /sys/class/leds/enp42s0-0::lan to netdev manually. Still reverting 66601a29bb23 should avoid the problem getting triggered and this would seem like a safe fix especially for the 6.9 series and then the necessary time can be taken to fix the actual underlying locking issue which 66601a29bb23 exposes.
I recently wrote a new input-events LED trigger: https://lore.kernel.org/linux-leds/20240531135910.168965-2-hdegoede@redhat.com/ (local) and I just found out this has a very similar deadlock problem. It seems there it a generic problem with LEDs or LED triggers getting registered by subsystems while holding a global lock from that subsystem vs the activate / deactivate method of the trigger calling functions of that same subsystem which require that same global lock. I came up with a fix but that just fixed the activate() path leaving a similar deadlock in place in the deactivate path: https://lore.kernel.org/linux-leds/20240601195528.48308-1-hdegoede@redhat.com/ (local) https://lore.kernel.org/linux-leds/20240601195528.48308-3-hdegoede@redhat.com/ (local) So this seems to be a non trivial problem to solve. For the new input-events trigger I plan to solve this by switching to a single input_handler which will be registered at module_init() time instead of having a handler per LED and registering that handler at activate() time. But I have no idea how to solve this for the netdev trigger I just wanted to share my experience with the input-events trigger in case it is useful. Regards, Hans