Re: [PATCH v2] device property: Make modifications of fwnode "flags" thread safe

From: Doug Anderson <dianders@chromium.org>
Date: 2026-03-21 03:16:55
Also in: driver-core, imx, linux-acpi, linux-arm-kernel, linux-devicetree, linux-i2c, linux-spi, lkml, stable

Hi,

On Thu, Mar 19, 2026 at 10:52 AM Doug Anderson [off-list ref] wrote:

Hi,

On Thu, Mar 19, 2026 at 10:25 AM Saravana Kannan [off-list ref] wrote:

quoted

On Tue, Mar 17, 2026 at 9:04 AM Douglas Anderson [off-list ref] wrote:

quoted

In various places in the kernel, we modify the fwnode "flags" member
by doing either:
  fwnode->flags |= SOME_FLAG;
  fwnode->flags &= ~SOME_FLAG;

This type of modification is not thread-safe. If two threads are both
mucking with the flags at the same time then one can clobber the
other.

While flags are often modified while under the "fwnode_link_lock",
this is not universally true.

Create some accessor functions for setting, clearing, and testing the
FWNODE flags and move all users to these accessor functions. New
accessor functions use set_bit() and clear_bit(), which are
thread-safe.

Cc: stable@vger.kernel.org
Fixes: c2c724c868c4 ("driver core: Add fw_devlink_parse_fwtree()")
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
While this patch is not known for sure to fix any specific issues, it
seems possible that it could fix some rare problems. I'm currently
trying to track down a hard-to-reproduce heisenbug and one (currently
unproven) theory I had was that the fwnode flags could be getting
messed up like this. Even if turns out not to fix my heisenbug,
though, this seems like a worthwhile change to take.

Reviewed-by: Saravana Kannan <saravanak@kernel.org>

Thanks for the review!

quoted

Thanks Doug. Hope this isn't the cause of the hisenbug. If you report
it here, I might be able to take a look at it too (no promises).

I don't _think_ it fixes my bug, but I'm still not 100% sure because
the bug can take a day or so to reproduce and it appears to only
reproduce on official kernels built by the builder. :( This makes it
hard to say anything for certain and also hard for me to inject extra
debug logic.

Just in case anyone out there was wracking their brains based on my
description of the bug...

I've made progress in getting the issue to reproduce even with debug
information added. With that, I've found that
device_links_driver_bound() is getting called where `dev->fwnode->dev`
is NULL. That prevents it from running the ever-important
__fw_devlink_pickup_dangling_consumers().

I can see that device_add() has started, but it just hasn't made it to
the `dev->fwnode->dev = dev;` line yet. My printout next to that line
shows up _after_ my printout in device_links_driver_bound().

So obviously something can happen to cause the device to probe before
the call to bus_probe_device().

OK, I managed to get a stack crawl for when `dev->fwnode->dev ==
NULL`. It looks like this (FWIW, it's a 6.6 kernel but issue also
reproduces on our 6.12 kernel, and I see no reason it wouldn't
reproduce on mainline):

Call trace:
dump_backtrace+0xe8/0x108
show_stack+0x18/0x28
dump_stack_lvl+0x50/0x6c
dump_stack+0x18/0x24
device_links_driver_bound+0xa4/0x4b4
driver_bound+0x48/0x1c4
really_probe+0x244/0x374
__driver_probe_device+0xa0/0x12c
driver_probe_device+0x3c/0x218
__driver_attach+0x110/0x1ec
bus_for_each_dev+0x104/0x160
driver_attach+0x24/0x34
bus_add_driver+0x154/0x270
driver_register+0x68/0x104
__platform_driver_register+0x24/0x34
init_module+0x20/0xfe4 [max77779_pmic_pinctrl
e09198e651272bc5df70245355346d6eb1ba3a8f]
do_one_initcall+0xdc/0x360
do_init_module+0x58/0x23c
load_module+0xffc/0x1130
__arm64_sys_finit_module+0x260/0x300
invoke_syscall+0x58/0x114

It looks like what happens is that immediately after device_add()
calls bus_add_device() there's a possibility of another thread
inserting the module holding the device driver. That means that the
driver can start probing much earlier than we expect.

I've posted an RFC patch to fix this. If folks are interested, please review it:

https://lore.kernel.org/r/20260320200656.RFC.1.Id750b0fbcc94f23ed04b7aecabcead688d0d8c17@changeid (local)

Given the stack crawl I got, I'm fairly certain that this will fix the
problem, but I'll also let reboot tests run over the weekend to
confirm.

-Doug

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help