Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
From: Dave Hansen <hidden>
Date: 2021-07-08 16:58:57
Also in:
kvm, linux-crypto, linux-efi, linux-mm, lkml, platform-driver-x86
On 7/8/21 9:48 AM, Brijesh Singh wrote:
On 7/8/21 10:30 AM, Dave Hansen wrote:quoted
quoted
The reason for iterating through 2MB region is; if the faulting address is not assigned in the RMP table, and page table walk level is 2MB then one of entry within the large page is the root cause of the fault. Since we don't know which entry hence I dump all the non-zero entries.Logically you can figure this out though, right? Why throw 511 entries at the console when we *know* they're useless?Logically its going to be tricky to figure out which exact entry caused the fault, hence I dump any non-zero entry. I understand it may dump some useless.
What's tricky about it? Sure, there's a possibility that more than one entry could contribute to a fault. But, you always know *IF* an entry could contribute to a fault. I'm fine if you run through the logic, don't find a known reason (specific RMP entry) for the fault, and dump the whole table in that case. But, unconditionally polluting the kernel log with noise isn't very nice for debugging.
quoted
quoted
There are two cases which we need to consider: 1) the faulting page is a guest private (aka assigned) 2) the faulting page is a hypervisor (aka shared) We will be primarily seeing #1. In this case, we know its a assigned page, and we can decode the fields. The #2 will happen in rare conditions,What rare conditions?One such condition is RMP "in-use" bit is set; see the patch 20/40. After applying the patch we should not see "in-use" bit set. If we run into similar issues, a full RMP dump will greatly help debug.
OK... so dump the "in-use" bit here if you see it.
quoted
quoted
if it happens, one of the undocumented bit in the RMP entry can provide us some useful information hence we dump the raw values.You're saying that there are things that can cause RMP faults that aren't documented? That's rather nasty for your users, don't you think?The "in-use" bit in the RMP entry caught me off guard. The AMD APM does says that hardware sets in-use bit but it *never* explained in the detail on how to check if the fault was due to in-use bit in the RMP table. As I said, the documentation folks will be updating the RMP entry to document the in-use bit. I hope we will not see any other undocumented surprises, I am keeping my finger cross :)
Oh, ok. That sounds fine. Documentation is out of date all the time.