Re: Random reboots on ODROID-N2+
From: Robin Murphy <robin.murphy@arm.com>
Date: 2021-07-23 16:16:02
Also in:
linux-amlogic
On 2021-07-23 16:56, Stefan Agner wrote:
Hi Byron, Hi Robin, Very interesting findings! On 2021-07-23 17:36, Robin Murphy wrote:quoted
On 2021-07-23 15:25, Byron Stanoszek wrote:quoted
On Tue, 22 Jun 2021, Stefan Agner wrote:quoted
On 2021-05-17 11:14, Stefan Agner wrote:quoted
Hi, We are currently testing a new release using Linux 5.10.33. I've received since several reports of random reboots every couple of days. Unfortunately the log (journald) doesn't show anything, just a hard cut at some point. After running serial console on several instances, I was able to catch this stack trace: [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33 #1 [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT) [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390 [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390<snip> We do see those crashes in similar frequency with Linux 5.12: [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError It seems load and/or hardware dependent since we see it on some devices quite frequent (every few days), and on others it takes multiple weeks. Of course the once we see it frequently are the ones in production :). I am currently trying different stress-ng and other load to accelerate the crash rate before then trying to git bisect it.I have an Odroid-N2+ and was able to track this problem down. The problem is related to the following dmesg line that reads "failed to reserve memory" below: Machine model: Hardkernel ODROID-N2Plus memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604 memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664 memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50 OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiBIn my 5.9 builds that line isn't present, and it seems all logs I stored from 5.10 builds have the change already and show this line.quoted
quoted
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c OF: reserved mem: node linux,cma compatible matching fail memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8 ... A subsequent "cat /proc/iomem" shows that this memory region is still reserved and the system appears to operate normally, until eventually the SError Interrupt comes in under heavy memory/page-cache usage. The difference with later kernels is that now the memory at 0x5000000-0x52fffff is registered under the "System RAM" memory area, whereas previous kernels had dropped it from "System RAM". The culprit is this new code introduced in Linux 5.12, in this function in drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():It seems that patch got also backported, so that is why I see it with 5.10 as well.quoted
quoted
int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base, phys_addr_t size, bool nomap) { if (nomap) { /* * If the memory is already reserved (by another region), we * should not allow it to be marked nomap. */ if (memblock_is_region_reserved(base, size)) <------ return -EBUSY; <------ return memblock_mark_nomap(base, size); } return memblock_reserve(base, size); } "nomap" is true, due to this text being present in the FDT: reserved-memory { ranges secmon_reserved: secmon@5000000 { reg = <0x0 0x05000000 0x0 0x300000> no-map } ... But memblock_is_region_reserved() is returning true because there is already an entry for 0x5000000-0x52fffff in the memory map, which is already marked reserved, at the time the __reserved_mem_reserve_reg() function is called. (Perhaps this is being set reserved by u-boot? -- I did not research that far.) This function is defined as: bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size) { return memblock_overlaps_region(&memblock.reserved, base, size); } Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing reserved region "0x5000000-0x52fffff", the function returns true. If I comment out the "if (memblock_is_region_reserved(base, size))" code and allow it to mark the region no-map, then the memory area is properly removed from the "System RAM" area and the crashing stops. I've had the system up and running for 15 days now under heavy load without any crashes, using just the following patch as workaround:--- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000 -0400 +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400@@ -1157,13 +1157,6 @@ phys_addr_t size, bool nomap) { if (nomap) { - /* - * If the memory is already reserved (by another region), we - * should not allow it to be marked nomap. - */ - if (memblock_is_region_reserved(base, size)) - return -EBUSY; - return memblock_mark_nomap(base, size); } return memblock_reserve(base, size);The above patch applies to later versions of Linux 5.10.x through 5.12.x as well.Eventhough probably not the correct solution, I'll give this a try on my end just to verify we are indeed experience the same problem and the change fixes it for me too.quoted
quoted
Perhaps a more proper fix is to allow the no-map to still proceed, in the case that the existing reserved region is identical (same start/end) to the region getting marked no-map.If U-Boot is marking regions with the wrong type/attributes in the EFI memory map, then the best thing to do would be to fix that. I see a fairly recent commit which looks suspiciously relevant: https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004It seems that this patch went into U-Boot 2021.04 which I am using, so that (alone) seems not to fix the mapping.quoted
Booting with "efi=debug" should (among other things) print the memory map at boot if you want to double-check that that is the source of the mismatch. Our EFI code should be perfectly capable of setting the memblock flag if the region *is* described appropriately, see reserve_regions() in drivers/firmware/efi/efi-init.c.Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this: [ 0.000000] Machine model: Hardkernel ODROID-N2Plus [ 0.000000] efi: Getting UEFI parameters from /chosen in DT: [ 0.000000] efi: UEFI not found. [ 0.000000] OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB So it seems UEFI is not in the play here?
Ah, OK, in that case I guess the question remains why does early_init_dt_reserve_memory_arch() think the region is already reserved? My instinctive assumption was an EFI memory map being present; seeing that U-Boot does indeed reflect DT reservations there *and* has had a likely-looking bug recently was then just overwhelmingly suggestive :) Robin. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel