[RFC] [PATCH] arm64: survive after access to unimplemented register
From: mark.rutland@arm.com (Mark Rutland)
Date: 2016-03-31 13:12:51
Also in:
lkml
On Thu, Mar 31, 2016 at 03:28:59PM +0300, Yury Norov wrote:
Hi Mark, On Thu, Mar 31, 2016 at 11:05:48AM +0100, Mark Rutland wrote:quoted
On Thu, Mar 31, 2016 at 05:27:03AM +0300, Yury Norov wrote:quoted
Not all vendors implement all the system registers ARM specifies.The ID registers in question are precisely documented in the ARM ARM (see table C5-6 in ARM DDI 0487A.i). Specifically, the ID space ID_AA64MMFR2_EL1 now falls in to is listed as RAZ. Any deviation from this is an erratum, and needs to be handled as such (e.g. listing in silicon-errata.txt). Does the issue affect ThunderX natively?Yes, Thunder is involved, but I cannot tell more due to NDA. And this error is not in silicon-errata.txt. I'll ask permission to share more details.
Ok. Regardless of how this is solved, we need to know the details of the erratum (and need an entry in silicon-errata.txt).
quoted
quoted
So access them causes undefined instruction abort and then kernel panic. There are 3 ways to handle it we can figure out: - use conditional compilation and erratas; - use kernel patching; - inline fixups resolving the abort. Last option is more robust as it does not require additional efforts to support targers. It is looking reasonable because in many cases optional registers should be RAZ if not implemented. Special cases may be handled by underlying __read_cpuid() when needed.I don't think we should do this if the only affected implementations are software emulators which can be patched (and have already been, in the case of QEMU). In future it's very likely that early assembly code (potentially in hypervisor context) will need to access ID registers which are currently reserved/RAZ, and it will be rather painful to fix up accesses to this.So we will not fix. This one fixes el1 only, and don't pretend for more.
At some point, it's practically guaranteed that we will have to access reserved/RAZ ID registers in other cases, so we _will_ need workarounds that cater for those sooner or later. We need to consider how we can handle those, in case it implies constraints on our solution elsewhere, or requires a more complex, but more general solution (which we can implement part of today). For example: * The sanity checks code will perform many back-to-back register accesses. Trapping lots of these could be expensive, so not performing the MRS at all when known to be unsafe may be preferable. * Some registers may be read in a hot/critical path, or potentially in a context where we cannot handle trapping (e.g. early boot code or parts of KVM). In some cases, patching may be preferable to an MRS that only gets executed depending on a branch condition. Before we can do any of this, we need to know the conditions of the erratum, however.
quoted
Additionally, this workaround will silently mask other bugs in this area (e.g. if registers like ID_AA64MMFR0_EL1 were to trap for some reason on an implementation), which doesn't seem good.We can mask it less silently, for example, print message to dmesg. Initially I was thinking about erratas as well, but Arnd suggested this approach, and now think it's better. From consumer point of view, it's much better to have a warning line in dmesg, instead of bricked device, after another kernel or driver update.
Having some warning is certainly better, though I think we need to scream _very loudly_ for cases we do not expect, as non-fatal warnings are easily/often ignored, and can later turn out to be more critical than previously believed. Thanks, Mark.