Re: [PATCH Part2 RFC v4 10/40] x86/fault: Add support to handle the RMP fault for user address
From: Dave Hansen <hidden>
Date: 2021-07-08 16:19:33
Also in:
kvm, linux-coco, linux-crypto, linux-mm, lkml, platform-driver-x86
Oh, here's the THP code. The subject just changed. On 7/7/21 11:35 AM, Brijesh Singh wrote:
When SEV-SNP is enabled globally, a write from the host goes through the RMP check. When the host writes to pages, hardware checks the following conditions at the end of page walk: 1. Assigned bit in the RMP table is zero (i.e page is shared). 2. If the page table entry that gives the sPA indicates that the target page size is a large page, then all RMP entries for the 4KB constituting pages of the target must have the assigned bit 0. 3. Immutable bit in the RMP table is not zero. The hardware will raise page fault if one of the above conditions is not met. Try resolving the fault instead of taking fault again and again. If the host attempts to write to the guest private memory then send the SIGBUG signal to kill the process. If the page level between the host and
"SIGBUG"?
RMP entry does not match, then split the address to keep the RMP and host page levels in sync.
quoted hunk ↗ jump to hunk
--- arch/x86/mm/fault.c | 69 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/mm.h | 6 +++- mm/memory.c | 13 +++++++++ 3 files changed, 87 insertions(+), 1 deletion(-)diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 195149eae9b6..cdf48019c1a7 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c@@ -1281,6 +1281,58 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code, } NOKPROBE_SYMBOL(do_kern_addr_fault); +#define RMP_FAULT_RETRY 0 +#define RMP_FAULT_KILL 1 +#define RMP_FAULT_PAGE_SPLIT 2 + +static inline size_t pages_per_hpage(int level) +{ + return page_level_size(level) / PAGE_SIZE; +} + +static int handle_user_rmp_page_fault(unsigned long hw_error_code, unsigned long address) +{ + unsigned long pfn, mask; + int rmp_level, level; + struct rmpentry *e; + pte_t *pte; + + if (unlikely(!cpu_feature_enabled(X86_FEATURE_SEV_SNP))) + return RMP_FAULT_KILL;
Shouldn't this be a WARN_ON_ONCE()? How can we get RMP faults without SEV-SNP?
+ /* Get the native page level */ + pte = lookup_address_in_mm(current->mm, address, &level); + if (unlikely(!pte)) + return RMP_FAULT_KILL;
What would this mean? There was an RMP fault on a non-present page? How could that happen? What if there was a race between an unmapping event and the RMP fault delivery?
+ pfn = pte_pfn(*pte);
+ if (level > PG_LEVEL_4K) {
+ mask = pages_per_hpage(level) - pages_per_hpage(level - 1);
+ pfn |= (address >> PAGE_SHIFT) & mask;
+ }This looks inherently racy. What happens if there are two parallel RMP faults on the same 2M page. One of them splits the page tables, the other gets a fault for an already-split page table. Is that handled here somehow?
+ /* Get the page level from the RMP entry. */ + e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &rmp_level); + if (!e) + return RMP_FAULT_KILL;
The snp_lookup_page_in_rmptable() failure cases looks WARN-worthly. Either you're doing a lookup for something not *IN* the RMP table, or you don't support SEV-SNP, in which case you shouldn't be in this code in the first place.
+ /* + * Check if the RMP violation is due to the guest private page access. + * We can not resolve this RMP fault, ask to kill the guest. + */ + if (rmpentry_assigned(e)) + return RMP_FAULT_KILL;
No "We's", please. Speak in imperative voice.
+ /* + * The backing page level is higher than the RMP page level, request + * to split the page. + */ + if (level > rmp_level) + return RMP_FAULT_PAGE_SPLIT;
This can theoretically trigger on a hugetlbfs page. Right? I thought I asked about this before... more below...
quoted hunk ↗ jump to hunk
+ return RMP_FAULT_RETRY; +} + /* * Handle faults in the user portion of the address space. Nothing in here * should check X86_PF_USER without a specific justification: for almost@@ -1298,6 +1350,7 @@ void do_user_addr_fault(struct pt_regs *regs, struct task_struct *tsk; struct mm_struct *mm; vm_fault_t fault; + int ret; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current;@@ -1378,6 +1431,22 @@ void
(struct pt_regs *regs,
if (error_code & X86_PF_INSTR)
flags |= FAULT_FLAG_INSTRUCTION;
+ /*
+ * If its an RMP violation, try resolving it.
+ */
+ if (error_code & X86_PF_RMP) {
+ ret = handle_user_rmp_page_fault(error_code, address);
+ if (ret == RMP_FAULT_PAGE_SPLIT) {
+ flags |= FAULT_FLAG_PAGE_SPLIT;
+ } else if (ret == RMP_FAULT_KILL) {
+ fault |= VM_FAULT_SIGBUS;
+ do_sigbus(regs, error_code, address, fault);
+ return;
+ } else {
+ return;
+ }
+ }Why not just have handle_user_rmp_page_fault() return a VM_FAULT_* code directly? I also suspect you can just set VM_FAULT_SIGBUS and let the do_sigbus() call later on in the function do its work.
quoted hunk ↗ jump to hunk
* Faults in the vsyscall page might need emulation. Thediff --git a/include/linux/mm.h b/include/linux/mm.h index 322ec61d0da7..211dfe5d3b1d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h@@ -450,6 +450,8 @@ extern pgprot_t protection_map[16]; * @FAULT_FLAG_REMOTE: The fault is not for current task/mm. * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch. * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals. + * @FAULT_FLAG_PAGE_SPLIT: The fault was due page size mismatch, split the + * region to smaller page size and retry. * * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify * whether we would allow page faults to retry by specifying these two@@ -481,6 +483,7 @@ enum fault_flag { FAULT_FLAG_REMOTE = 1 << 7, FAULT_FLAG_INSTRUCTION = 1 << 8, FAULT_FLAG_INTERRUPTIBLE = 1 << 9, + FAULT_FLAG_PAGE_SPLIT = 1 << 10, }; /*@@ -520,7 +523,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags) { FAULT_FLAG_USER, "USER" }, \ { FAULT_FLAG_REMOTE, "REMOTE" }, \ { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \ - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \ + { FAULT_FLAG_PAGE_SPLIT, "PAGESPLIT" } /* * vm_fault is filled by the pagefault handler and passed to the vma'sdiff --git a/mm/memory.c b/mm/memory.c index 730daa00952b..aef261d94e33 100644 --- a/mm/memory.c +++ b/mm/memory.c@@ -4407,6 +4407,15 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf) return 0; } +static int handle_split_page_fault(struct vm_fault *vmf) +{ + if (!IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) + return VM_FAULT_SIGBUS; + + __split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL); + return 0; +}
What will this do when you hand it a hugetlbfs page?