Re: ARM board lockups/hangs triggered by locks and mutexes
From: Rafał Miłecki <zajec5@gmail.com>
Date: 2023-08-02 07:02:23
Also in:
linux-arm-kernel, linux-clk, lkml
On 2.08.2023 00:25, Florian Fainelli wrote:
Hi Rafal, On 8/1/23 15:10, Rafał Miłecki wrote:quoted
Hi, Years ago I added support for Broadcom's BCM53573 SoCs. We released firmwares based on Linux 4.4 (and later on 4.14) that worked almost fine. There was one little issue we couldn't debug or fix: random hangs and reboots. They were too rare to deal with (most devices worked fine for weeks or months). Recently I updated my stable kernel 5.4 and I started experiencing stability issues on my own! After some uptime (usually from 0 to 20 minutes of close to zero activity) serial console hangs. I can't type anything and I stop getting any messages. I've to wait about a minute for watchdog to kick in and reboot device. ##### I took that great chance and decided to track the regression. Linux 5.4 stable branch worked stable up to the release v5.4.197. Starting with v5.4.198 I started experiencing those stability issues. I bisected it down to the commit 4460066eb248 ("ipv6: fix locking issues with loops over idev->addr_list"): https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=4460066eb2480b9e203c73755e12e2efc820a27e With above commit reverted I was able to use stable 5.4 branch up to the release v5.4.207. Starting with v5.4.208 it got unstable again. I bisected it down to: commit d0d583484d2e ("locking/refcount: Consolidate implementations of refcount_t") commit dab787c73f6e ("locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions") commit 0d3182fbe689 ("locking/refcount: Move saturation warnings out of line") commit 809554147d60 ("locking/refcount: Improve performance of generic REFCOUNT_FULL code") commit 9c9269977f03 ("locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header") commit 04bff7d7b808 ("locking/refcount: Remove unused refcount_*_checked() variants") commit 513b19a43bec ("locking/refcount: Ensure integer operands are treated as signed") commit 68b4ee68e8c8 ("locking/refcount: Define constants for saturation and max refcount values") https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=d0d583484d2ed9f5903edbbfa7e2a68f78b950b0 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=dab787c73f6e38d8e7ed3c1e683385e8f0fe28a2 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=0d3182fbe689e3808c03b6cde6be98237f9e0a4a https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=809554147d609163cfbaf815c443c575b538a7ef https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=9c9269977f03ab9c448c8b71581a951e0eb4fb7b https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=04bff7d7b8081c4bb2e8171be31d33df297eee5b https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=513b19a43becee5f7af6d283bb9d3d241a8a21a8 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.4.y&id=68b4ee68e8c8800cf8d6b61cc74b4031a0742a4c (I didn't actually check above commits individually). Reverting above locking/refcount commits worked fine for few releases: up to the v5.4.219. Starting with v5.4.220 I got hangs again. I bisected that down to the commit 131287ff833d ("once: add DO_ONCE_SLOW() for sleepable contexts"). Reverting that extra commit from v5.4.238 allows me to run Linux for hours again (currently 3 devices x 6 hours and counting). So I need in total 10+1 reverts from 5.4 branch to get a stable kernel. ##### I'm clueless at this point. Is that possible kernel has some locking bug I can hit only using this specific SoC? BCM53573s have a single ARM Cortex-A7 CPU running at 900 MHz. The only unusual thing about this hw I can think of is a slow arch timer running at 36,8 kHz.From the look of it, it seems like the CPU might have bugs with atomics? Your log indicates that your Cortex-A7 is r0p5 which is described to be susceptible to ARM_ERRATA_814220, do you have it enabled by any chance, if not, can you enable it and see if makes any difference?
I had it disabled. Unfortunately CONFIG_ARM_ERRATA_814220=y doesn't help.