Thread (21 messages) 21 messages, 7 authors, 2024-06-24

Re: Hung tasks due to a AB-BA deadlock between the leds_list_lock rwsem and the rtnl mutex

From: Hans de Goede <hidden>
Date: 2024-06-01 20:05:59
Also in: linux-leds, lkml, regressions

Hi All,

On 5/31/24 12:22 PM, Hans de Goede wrote:
Hi,

On 5/31/24 11:50 AM, Hans de Goede wrote:
quoted
Hi,

On 5/31/24 10:39 AM, Linux regression tracking (Thorsten Leemhuis) wrote:
quoted
[adding the LED folks and the regressions list to the list of recipients]

Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
for once, to make this easily accessible to everyone.

Lee, Pavel, could you look into below regression report please? Thread
starts here:
https://lore.kernel.org/all/9d189ec329cfe68ed68699f314e191a10d4b5eda.camel@sapience.com/ (local)

Another report with somewhat similar symptom can be found here:
https://lore.kernel.org/lkml/e441605c-eaf2-4c2d-872b-d8e541f4cf60@gmail.com/ (local)

See also Russell's analysis of that report below (many many thx for
that, much appreciated Russel!).

To my untrained eyes all of this sounds a lot like we still have a 6.9
regression related to the LED code somewhere. Reminder, we had earlier
trouble, but that was avoided through other measures:

* 3d913719df14c2 ("wifi: iwlwifi: Use request_module_nowait") /
https://lore.kernel.org/lkml/30f757e3-73c5-5473-c1f8-328bab98fd7d@candelatech.com/ (local)

* c04d1b9ecce565 ("igc: Fix LED-related deadlock on driver unbind") /
https://lore.kernel.org/all/ZhRD3cOtz5i-61PB@mail-itl/ (local)

* 19fa4f2a85d777 ("r8169: fix LED-related deadlock on module removal")

That iwlwifi commit even calls it self "work around". The developer that
submitted it bisected the problem to a LED merge, but sadly that was the
end of it. :-/
I actually have been looking at a ledtrig-netdev lockdep warning yesterday
which I believe is the same thing. I'll include the lockdep trace below.

According to lockdep there indeed is a ABBA (ish) cyclic deadlock with
the rtnl mutex vs led-triggers related locks. I believe that this problem
may be a pre-existing problem but this now actually gets hit in kernels >=
6.9 because of commit 66601a29bb23 ("leds: class: If no default trigger is
given, make hw_control trigger the default trigger"). Before that commit
the "netdev" trigger would not be bound / set as phy LEDs trigger by default.

+Cc Heiner Kallweit who authored that commit.

The netdev trigger typically is not needed because the PHY LEDs are typically
under hw-control and the netdev trigger even tries to leave things that way
so setting it as the active trigger for the LED class device is basically
a no-op. I guess the goal of that commit is correctly have the triggers
file content reflect that the LED is controlled by a netdev and to allow
changing the hw-control mode without the user first needing to set netdev
as trigger before being able to change the mode.

But there is a price to this, besides the locking problem this also
causes the ledtrig-netdev module to load on pretty much everyones
systems (when build as a module) even though 99.999% of our users
likely does not need this at all...

Given this price and the troubles this is causing I think it might be best
to revert 66601a29bb23. There might still be a locking issue when setting
the trigger to netdev manually (I'll check and follow up) but this should
fix the regression users are hitting since typically users do not set
the trigger manually.
Ok, I can confirm that the lockdep warning is gone for me with 66601a29bb23
reverted. Unfortunately it does still happen after a "modprobe ledtrig_netdev"
(to add it to the list of available triggers) and then setting the trigger
for /sys/class/leds/enp42s0-0::lan to netdev manually.

Still reverting 66601a29bb23 should avoid the problem getting triggered
and this would seem like a safe fix especially for the 6.9 series and
then the necessary time can be taken to fix the actual underlying locking
issue which 66601a29bb23 exposes.
I recently wrote a new input-events LED trigger:
https://lore.kernel.org/linux-leds/20240531135910.168965-2-hdegoede@redhat.com/ (local)

and I just found out this has a very similar deadlock problem. It seems
there it a generic problem with LEDs or LED triggers getting registered
by subsystems while holding a global lock from that subsystem vs
the activate / deactivate method of the trigger calling functions of that
same subsystem which require that same global lock.

I came up with a fix but that just fixed the activate() path leaving
a similar deadlock in place in the deactivate path:

https://lore.kernel.org/linux-leds/20240601195528.48308-1-hdegoede@redhat.com/ (local)
https://lore.kernel.org/linux-leds/20240601195528.48308-3-hdegoede@redhat.com/ (local)

So this seems to be a non trivial problem to solve. For the new input-events
trigger I plan to solve this by switching to a single input_handler which will
be registered at module_init() time instead of having a handler per LED and
registering that handler at activate() time.

But I have no idea how to solve this for the netdev trigger I just wanted
to share my experience with the input-events trigger in case it is useful.

Regards,

Hans

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help