Thread (17 messages) 17 messages, 6 authors, 2021-07-26

Re: Random reboots on ODROID-N2+

From: Robin Murphy <robin.murphy@arm.com>
Date: 2021-07-23 16:16:02
Also in: linux-amlogic

On 2021-07-23 16:56, Stefan Agner wrote:
Hi Byron, Hi Robin,

Very interesting findings!

On 2021-07-23 17:36, Robin Murphy wrote:
quoted
On 2021-07-23 15:25, Byron Stanoszek wrote:
quoted
On Tue, 22 Jun 2021, Stefan Agner wrote:
quoted
On 2021-05-17 11:14, Stefan Agner wrote:
quoted
Hi,

We are currently testing a new release using Linux 5.10.33. I've
received since several reports of random reboots every couple of days.
Unfortunately the log (journald) doesn't show anything, just a hard cut
at some point.

After running serial console on several instances, I was able to catch
this stack trace:

[202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
[202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
#1
[202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
[202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
[202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
<snip>

We do see those crashes in similar frequency with Linux 5.12:

[129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError

It seems load and/or hardware dependent since we see it on some devices
quite frequent (every few days), and on others it takes multiple weeks.
Of course the once we see it frequently are the ones in production :).

I am currently trying different stress-ng and other load to accelerate
the crash rate before then trying to git bisect it.
I have an Odroid-N2+ and was able to track this problem down. The problem is
related to the following dmesg line that reads "failed to reserve memory"
below:

Machine model: Hardkernel ODROID-N2Plus
memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
OF: fdt: Reserved memory: failed to reserve memory for node 'secmon@5000000': base 0x0000000005000000, size 3 MiB
In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.
quoted
quoted
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
OF: reserved mem: node linux,cma compatible matching fail
memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
...

A subsequent "cat /proc/iomem" shows that this memory region is still reserved
and the system appears to operate normally, until eventually the SError
Interrupt comes in under heavy memory/page-cache usage. The difference with
later kernels is that now the memory at 0x5000000-0x52fffff is registered under
the "System RAM" memory area, whereas previous kernels had dropped it from
"System RAM".

The culprit is this new code introduced in Linux 5.12, in this function in
drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
It seems that patch got also backported, so that is why I see it with
5.10 as well.
quoted
quoted
int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
                                          phys_addr_t size, bool nomap)
{
          if (nomap) {
                  /*
                   * If the memory is already reserved (by another region), we
                   * should not allow it to be marked nomap.
                   */
                  if (memblock_is_region_reserved(base, size))  <------
                          return -EBUSY;                        <------

                  return memblock_mark_nomap(base, size);
          }
          return memblock_reserve(base, size);
}

"nomap" is true, due to this text being present in the FDT:

     reserved-memory {
       ranges secmon_reserved: secmon@5000000 {
         reg = <0x0 0x05000000 0x0 0x300000>
         no-map
       }
       ...

But memblock_is_region_reserved() is returning true because there is already an
entry for 0x5000000-0x52fffff in the memory map, which is already marked
reserved, at the time the __reserved_mem_reserve_reg() function is called.
(Perhaps this is being set reserved by u-boot? -- I did not research that far.)

This function is defined as:

bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
{
          return memblock_overlaps_region(&memblock.reserved, base, size);
}

Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
reserved region "0x5000000-0x52fffff", the function returns true.

If I comment out the "if (memblock_is_region_reserved(base, size))" code and
allow it to mark the region no-map, then the memory area is properly removed
from the "System RAM" area and the crashing stops.

I've had the system up and running for 15 days now under heavy load without any
crashes, using just the following patch as workaround:

--- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 -0400
+++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
@@ -1157,13 +1157,6 @@
                       phys_addr_t size, bool nomap)
   {
       if (nomap) {
-        /*
-         * If the memory is already reserved (by another region), we
-         * should not allow it to be marked nomap.
-         */
-        if (memblock_is_region_reserved(base, size))
-            return -EBUSY;
-
           return memblock_mark_nomap(base, size);
       }
       return memblock_reserve(base, size);

The above patch applies to later versions of Linux 5.10.x through 5.12.x as
well.
Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.
quoted
quoted
Perhaps a more proper fix is to allow the no-map to still proceed, in the case
that the existing reserved region is identical (same start/end) to the region
getting marked no-map.
If U-Boot is marking regions with the wrong type/attributes in the EFI
memory map, then the best thing to do would be to fix that. I see a
fairly recent commit which looks suspiciously relevant:

https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.
quoted
Booting with "efi=debug" should (among other things) print the memory
map at boot if you want to double-check that that is the source of the
mismatch. Our EFI code should be perfectly capable of setting the
memblock flag if the region *is* described appropriately, see
reserve_regions() in drivers/firmware/efi/efi-init.c.
Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[    0.000000] Machine model: Hardkernel ODROID-N2Plus
[    0.000000] efi: Getting UEFI parameters from /chosen in DT:
[    0.000000] efi: UEFI not found.
[    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon@5000000': base 0x0000000005000000, size 3 MiB

So it seems UEFI is not in the play here?
Ah, OK, in that case I guess the question remains why does 
early_init_dt_reserve_memory_arch() think the region is already 
reserved? My instinctive assumption was an EFI memory map being present; 
seeing that U-Boot does indeed reflect DT reservations there *and* has 
had a likely-looking bug recently was then just overwhelmingly suggestive :)

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help