Re: [PATCH v2] device property: Make modifications of fwnode "flags" thread safe
From: Doug Anderson <dianders@chromium.org>
Date: 2026-03-21 03:16:55
Also in:
driver-core, imx, linux-acpi, linux-arm-kernel, linux-devicetree, linux-i2c, linux-spi, lkml, stable
Hi, On Thu, Mar 19, 2026 at 10:52 AM Doug Anderson [off-list ref] wrote:
Hi, On Thu, Mar 19, 2026 at 10:25 AM Saravana Kannan [off-list ref] wrote:quoted
On Tue, Mar 17, 2026 at 9:04 AM Douglas Anderson [off-list ref] wrote:quoted
In various places in the kernel, we modify the fwnode "flags" member by doing either: fwnode->flags |= SOME_FLAG; fwnode->flags &= ~SOME_FLAG; This type of modification is not thread-safe. If two threads are both mucking with the flags at the same time then one can clobber the other. While flags are often modified while under the "fwnode_link_lock", this is not universally true. Create some accessor functions for setting, clearing, and testing the FWNODE flags and move all users to these accessor functions. New accessor functions use set_bit() and clear_bit(), which are thread-safe. Cc: stable@vger.kernel.org Fixes: c2c724c868c4 ("driver core: Add fw_devlink_parse_fwtree()") Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: Mark Brown <broonie@kernel.org> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Douglas Anderson <dianders@chromium.org> --- While this patch is not known for sure to fix any specific issues, it seems possible that it could fix some rare problems. I'm currently trying to track down a hard-to-reproduce heisenbug and one (currently unproven) theory I had was that the fwnode flags could be getting messed up like this. Even if turns out not to fix my heisenbug, though, this seems like a worthwhile change to take.Reviewed-by: Saravana Kannan <saravanak@kernel.org>Thanks for the review!quoted
Thanks Doug. Hope this isn't the cause of the hisenbug. If you report it here, I might be able to take a look at it too (no promises).I don't _think_ it fixes my bug, but I'm still not 100% sure because the bug can take a day or so to reproduce and it appears to only reproduce on official kernels built by the builder. :( This makes it hard to say anything for certain and also hard for me to inject extra debug logic.
Just in case anyone out there was wracking their brains based on my description of the bug... I've made progress in getting the issue to reproduce even with debug information added. With that, I've found that device_links_driver_bound() is getting called where `dev->fwnode->dev` is NULL. That prevents it from running the ever-important __fw_devlink_pickup_dangling_consumers(). I can see that device_add() has started, but it just hasn't made it to the `dev->fwnode->dev = dev;` line yet. My printout next to that line shows up _after_ my printout in device_links_driver_bound(). So obviously something can happen to cause the device to probe before the call to bus_probe_device(). OK, I managed to get a stack crawl for when `dev->fwnode->dev == NULL`. It looks like this (FWIW, it's a 6.6 kernel but issue also reproduces on our 6.12 kernel, and I see no reason it wouldn't reproduce on mainline): Call trace: dump_backtrace+0xe8/0x108 show_stack+0x18/0x28 dump_stack_lvl+0x50/0x6c dump_stack+0x18/0x24 device_links_driver_bound+0xa4/0x4b4 driver_bound+0x48/0x1c4 really_probe+0x244/0x374 __driver_probe_device+0xa0/0x12c driver_probe_device+0x3c/0x218 __driver_attach+0x110/0x1ec bus_for_each_dev+0x104/0x160 driver_attach+0x24/0x34 bus_add_driver+0x154/0x270 driver_register+0x68/0x104 __platform_driver_register+0x24/0x34 init_module+0x20/0xfe4 [max77779_pmic_pinctrl e09198e651272bc5df70245355346d6eb1ba3a8f] do_one_initcall+0xdc/0x360 do_init_module+0x58/0x23c load_module+0xffc/0x1130 __arm64_sys_finit_module+0x260/0x300 invoke_syscall+0x58/0x114 It looks like what happens is that immediately after device_add() calls bus_add_device() there's a possibility of another thread inserting the module holding the device driver. That means that the driver can start probing much earlier than we expect. I've posted an RFC patch to fix this. If folks are interested, please review it: https://lore.kernel.org/r/20260320200656.RFC.1.Id750b0fbcc94f23ed04b7aecabcead688d0d8c17@changeid (local) Given the stack crawl I got, I'm fairly certain that this will fix the problem, but I'll also let reboot tests run over the weekend to confirm. -Doug