Re: ARM board lockups/hangs triggered by locks and mutexes
From: Rafał Miłecki <zajec5@gmail.com>
Date: 2023-08-18 20:24:49
Also in:
linux-arm-kernel, linux-clk, lkml
On 14.08.2023 11:04, Geert Uytterhoeven wrote:
Hi Rafal, On Mon, Aug 7, 2023 at 1:11 PM Rafał Miłecki [off-list ref] wrote:quoted
On 4.08.2023 13:07, Rafał Miłecki wrote:quoted
I triple checked that. Dropping a single unused function breaks kernel / device stability on BCM53573! AFAIK the only thing below diff actually affects is location of symbols (I actually verified that by comparing System.map before and after - over 22'000 of relocated symbols). Can some unfortunate location of symbols cause those hangs/lockups?I performed another experiment. First I dropped mtd_check_of_node() to bring kernel back to the stable state. Then I started adding useless code to the mtdchar_unlocked_ioctl(). I ended up adding just enough to make sure all post-mtd symbols in System.map got the same offset as in case of backporting mtd_check_of_node(). I started experiencing lockups/hangs again. I repeated the same test with adding dumb code to the brcm_nvram_probe() and verifying symbols offsets following brcm_nvram_probe one. I believe this confirms that this problem is about offset or alignment of some specific symbol(s). The remaining question is what symbols and how to fix or workaround that.I had similar experiences on other ARM platforms many years ago: bisection lead to something completely bogus, and it turned out adding a single line of innocent code made the system lock-up or crash unexpectedly. It was definitely related to alignment, as adding the right extra amount of innocent code would fix the problem. Until some later change changing alignment again... I never found the real cause, but the problems went away over time. I am not sure I did enable all required errata config options, so I may have missed some...
I already experiented some weird performance variations on Broadcom's Northstar platform that was related to symbols layout & cache hit/miss ratio. For that reason I use -falign-functions=32 for that whole OpenWrt's "bcm53xx" target (it covers Northstar and BCM53573). So this aspect should be ruled out already in my case.