Thread (10 messages) 10 messages, 4 authors, 2022-08-01

Re: bcm2711_thermal: Kernel panic - not syncing: Asynchronous SError Interrupt

From: Nicolas Saenz Julienne <hidden>
Date: 2021-02-10 16:57:25
Also in: linux-pm, lkml

Hi Robin,

On Wed, 2021-02-10 at 16:25 +0000, Robin Murphy wrote:
On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:
quoted
[ Add Robin, Catalin and Florian in case they want to chime in ]

Hi Juerg, thanks for the report!

On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:
quoted
Trying to dump the BCM2711 registers kills the kernel:

# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
0-efc
# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers

[   62.857661] SError Interrupt on CPU1, code 0xbf000002 -- SError
So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
SError,' hence IIUC the rest of the error code is meaningless to anyone outside
of Broadcom/RPi.
It's imp-def from the architecture's PoV, but the implementation in this 
case is Cortex-A72, where 0x000002 means an attributable, containable 
Slave Error:

https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en

In other words, the thing at the other end of an interconnect 
transaction said "no" :)

(The fact that Cortex-A72 gets too far ahead of itself to take it as a 
synchronous external abort is a mild annoyance, but hey...)
Thanks for both your clarifications! Reading arm documentation is a skill on
its own.
quoted
The regmap is created through the following syscon device:

	avs_monitor: avs-monitor@7d5d2000 {
		compatible = "brcm,bcm2711-avs-monitor",
			     "syscon", "simple-mfd";
		reg = <0x7d5d2000 0xf00>;

		thermal: thermal {
			compatible = "brcm,bcm2711-thermal";
			#thermal-sensor-cells = <0>;
		};
	};

I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
full of addresses that trigger this same error. Also note that as per Florian's
comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
tell, at least 0x7d5d22b0 seems to be faulty too.

Any ideas/comments? My guess is that those addresses are marked somehow as
secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
the solution is to narrow the register range exposed by avs-monitor to whatever
bcm2711-thermal needs (which is ATM a single 32bit register).
When a peripheral decodes a region of address space, nobody says it has 
to accept accesses to *every* address in that space; registers may be 
sparsely populated, and although some devices might be "nice" and make 
unused areas behave as RAZ/WI, others may throw slave errors if you poke 
at the wrong places. As you note, in a TrustZone-aware device some 
registers may only exist in one or other of the Secure/Non-Secure 
address spaces.

Even when there is a defined register at a given address, it still 
doesn't necessarily accept all possible types of access; it wouldn't be 
particularly friendly, but a device *could* have, say, some registers 
that support 32-bit accesses and others that only support 16-bit 
accesses, and thus throw slave errors if you do the wrong thing in the 
wrong place.

It really all depends on the device itself.
All in all, assuming there is no special device quirk to apply, the feeling I'm
getting is to just let the error be. As you hint, firmware has no blame here,
and debugfs is a 'best effort, zero guarantees' interface after all.

Regards,
Nicolas
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help