Thread (4 messages) 4 messages, 2 authors, 2021-10-18

Re: [PATCH V2] ACPI / APEI: restore interrupt before panic in sdei flow

From: James Morse <james.morse@arm.com>
Date: 2021-10-18 17:21:34
Also in: linux-acpi, lkml

Hi Liguang,

On 14/10/2021 15:18, 乱石 wrote:
在 2021/10/14 1:44, James Morse 写道:
quoted
On 12/10/2021 15:29, Liguang Zhang wrote:
quoted
When hest acpi table configure Hardware Error Notification type as
Software Delegated Exception(0x0B) for RAS event, OS RAS interacts with
ATF by SDEI mechanism. On the firmware first system, OS was notified by
ATF sdei call.
quoted
quoted
If fatal RAS error occured, panic was called in sdei_asm_handle()
without ehf_deactivate_priority executed, which lead interrupt masked.
quoted
So far the story is:
Firmware generated and SDEI event (a kind of software NMI) because of a firmware
interrupt, but it hasn't completely handled the interrupt.

quoted
If interrupt masked, system would be halted in kdump flow like this:

arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for cmdq
arm-smmu-v3 arm-smmu-v3.3.auto: allocated 32768 entries for evtq
arm-smmu-v3 arm-smmu-v3.3.auto: allocated 65536 entries for priq
arm-smmu-v3 arm-smmu-v3.3.auto: SMMU currently enabled! Resetting...
quoted
How and why do firmware interrupts affect the IOMMU?
[...]
quoted
Could you debug why firmware interrupts being active prevent the SMMU from being reset. As
far as I can tell, those should be totally independent.
If ehf_deactivate_priority() was not executed, pmr_el1 register was not resumed to >0x80,
which leads
non-secure interrupts masked. arm_smmu_device_probe() finally called usleep_range() which
based on
hrtimer. Because non-secure timer interrupts was masked, usleep_range would not reponse.
Aha! So nothing to do with with the SMMU at all.

Your firmware has 'disabled' the interrupt by moving the CPUs priority mask so that no
interrupts at all can be taken.

I still think this is best fixed in firmware.

Papering over the problem here is not enough as the handler may encounter memory
corruption, take an exception, and panic() from some other part of the kernel. Its RAS -
we know something has gone wrong before we get to this point.

The OS needs to be able to call panic() at any point in time.


Your firmware should not deny the normal-world interrupts like this.
Please either complete the interrupt handling before calling into the normal world,
or disable it if you need the interrupt to not fire again. If the device that triggers the
interrupt doesn't have a disable, there are hardware registers in the GIC to do this.
(I don't know how TFA works here, it may be a bug in the upstream code)



Thanks,

James

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help