Inter-revision diff: patch 16

Comparing v5 (message) to v8 (message)

--- v5
+++ v8
@@ -1,146 +1,350 @@
-From: Yu-cheng Yu <yu-cheng.yu@intel.com>
-
-The CPU performs "shadow stack accesses" when it expects to encounter
-shadow stack mappings. These accesses can be implicit (via CALL/RET
-instructions) or explicit (instructions like WRSS).
-
-Shadow stack accesses to shadow-stack mappings can result in faults in
-normal, valid operation just like regular accesses to regular mappings.
-Shadow stacks need some of the same features like delayed allocation, swap
-and copy-on-write. The kernel needs to use faults to implement those
-features.
-
-The architecture has concepts of both shadow stack reads and shadow stack
-writes. Any shadow stack access to non-shadow stack memory will generate
-a fault with the shadow stack error code bit set.
-
-This means that, unlike normal write protection, the fault handler needs
-to create a type of memory that can be written to (with instructions that
-generate shadow stack writes), even to fulfill a read access. So in the
-case of COW memory, the COW needs to take place even with a shadow stack
-read. Otherwise the page will be left (shadow stack) writable in
-userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
-for shadow stack accesses, even if the access was a shadow stack read.
-
-For the purpose of making this clearer, consider the following example.
-If a process has a shadow stack, and forks, the shadow stack PTEs will
-become read-only due to COW. If the CPU in one process performs a shadow
-stack read access to the shadow stack, for example executing a RET and
-causing the CPU to read the shadow stack copy of the return address, then
-in order for the fault to be resolved the PTE will need to be set with
-shadow stack permissions. But then the memory would be changeable from
-userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
-COW, otherwise the shared page would be changeable from both processes.
-
-Shadow stack accesses can also result in errors, such as when a shadow
-stack overflows, or if a shadow stack access occurs to a non-shadow-stack
-mapping. Also, generate the errors for invalid shadow stack accesses.
-
+The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
+HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
+create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
+the same changes to pmd and pud.
+
+For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
+was not necessarily created with any of the existing PTE bit helpers.
+That means that it can return a pte_t with Write=0,Dirty=1, a shadow
+stack PTE, when it did not intend to create one.
+
+Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
+creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
+1. Marking Write=0 PTEs Dirty=1
+2. Marking Dirty=1 PTEs Write=0
+
+The first case cannot happen as the existing behavior of pte_modify() is to
+filter out any Dirty bit passed in newprot. Handle the second case by
+shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
+protected by the pte_modify() call. Apply the same changes to
+pmd_modify().
+
+Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
+Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
+Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Reviewed-by: Kees Cook <keescook@chromium.org>
+Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
 Tested-by: Pengfei Xu <pengfei.xu@intel.com>
 Tested-by: John Allen <john.allen@amd.com>
-Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
-Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
-Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Tested-by: Kees Cook <keescook@chromium.org>
 ---
-
-v5:
- - Add description of COW example (Boris)
- - Replace "permissioned" (Boris)
- - Remove capitalization of shadow stack (Boris)
+v6:
+ - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
+ - Open code _PAGE_SAVED_DIRTY part in pte_modify() (Boris)
+ - Change the logic so the open coded part is not too ugly
+ - Merge pte_modify() patch with this one because of the above
 
 v4:
- - Further improve comment talking about FAULT_FLAG_WRITE (Peterz)
-
-v3:
- - Improve comment talking about using FAULT_FLAG_WRITE (Peterz)
-
-v2:
- - Update commit log with verbiage/feedback from Dave Hansen
- - Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses
- - Update comments with some verbiage from Dave Hansen
-
- arch/x86/include/asm/trap_pf.h |  2 ++
- arch/x86/mm/fault.c            | 38 ++++++++++++++++++++++++++++++++++
- 2 files changed, 40 insertions(+)
-
-diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
-index 10b1de500ab1..afa524325e55 100644
---- a/arch/x86/include/asm/trap_pf.h
-+++ b/arch/x86/include/asm/trap_pf.h
-@@ -11,6 +11,7 @@
-  *   bit 3 ==				1: use of reserved bit detected
-  *   bit 4 ==				1: fault was an instruction fetch
-  *   bit 5 ==				1: protection keys block access
-+ *   bit 6 ==				1: shadow stack access fault
-  *   bit 15 ==				1: SGX MMU page-fault
+ - Break part patch for better bisectability
+---
+ arch/x86/include/asm/pgtable.h | 168 ++++++++++++++++++++++++++++-----
+ 1 file changed, 145 insertions(+), 23 deletions(-)
+
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index 349fcab0405a..05dfdbdf96b4 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
+  * The following only work if pte_present() is true.
+  * Undefined behaviour if not..
   */
- enum x86_pf_error_code {
-@@ -20,6 +21,7 @@ enum x86_pf_error_code {
- 	X86_PF_RSVD	=		1 << 3,
- 	X86_PF_INSTR	=		1 << 4,
- 	X86_PF_PK	=		1 << 5,
-+	X86_PF_SHSTK	=		1 << 6,
- 	X86_PF_SGX	=		1 << 15,
- };
- 
-diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
-index 7b0d4ab894c8..070b50c87415 100644
---- a/arch/x86/mm/fault.c
-+++ b/arch/x86/mm/fault.c
-@@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
- 				       (error_code & X86_PF_INSTR), foreign))
- 		return 1;
- 
-+	/*
-+	 * Shadow stack accesses (PF_SHSTK=1) are only permitted to
-+	 * shadow stack VMAs. All other accesses result in an error.
-+	 */
-+	if (error_code & X86_PF_SHSTK) {
-+		if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
-+			return 1;
-+		if (unlikely(!(vma->vm_flags & VM_WRITE)))
-+			return 1;
-+		return 0;
-+	}
-+
- 	if (error_code & X86_PF_WRITE) {
- 		/* write, present and write, not present: */
-+		if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
-+			return 1;
- 		if (unlikely(!(vma->vm_flags & VM_WRITE)))
- 			return 1;
- 		return 0;
-@@ -1331,6 +1345,30 @@ void do_user_addr_fault(struct pt_regs *regs,
- 
- 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
- 
-+	/*
-+	 * When a page becomes COW it changes from a shadow stack permission
-+	 * page (Write=0,Dirty=1) to (Write=0,Dirty=0,CoW=1), which is simply
-+	 * read-only to the CPU. When shadow stack is enabled, a RET would
-+	 * normally pop the shadow stack by reading it with a "shadow stack
-+	 * read" access. However, in the COW case the shadow stack memory does
-+	 * not have shadow stack permissions, it is read-only. So it will
-+	 * generate a fault.
-+	 *
-+	 * For conventionally writable pages, a read can be serviced with a
-+	 * read only PTE, and COW would not have to happen. But for shadow
-+	 * stack, there isn't the concept of read-only shadow stack memory.
-+	 * If it is shadow stack permission, it can be modified via CALL and
-+	 * RET instructions. So COW needs to happen before any memory can be
-+	 * mapped with shadow stack permissions.
-+	 *
-+	 * Shadow stack accesses (read or write) need to be serviced with
-+	 * shadow stack permission memory, so in the case of a shadow stack
-+	 * read access, treat it as a WRITE fault so both COW will happen and
-+	 * the write fault path will tickle maybe_mkwrite() and map the memory
-+	 * shadow stack.
-+	 */
-+	if (error_code & X86_PF_SHSTK)
-+		flags |= FAULT_FLAG_WRITE;
- 	if (error_code & X86_PF_WRITE)
- 		flags |= FAULT_FLAG_WRITE;
- 	if (error_code & X86_PF_INSTR)
+-static inline int pte_dirty(pte_t pte)
++static inline bool pte_dirty(pte_t pte)
+ {
+-	return pte_flags(pte) & _PAGE_DIRTY;
++	return pte_flags(pte) & _PAGE_DIRTY_BITS;
++}
++
++static inline bool pte_shstk(pte_t pte)
++{
++	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++		return false;
++
++	return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
+ }
+ 
+ static inline int pte_young(pte_t pte)
+@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
+ 	return pte_flags(pte) & _PAGE_ACCESSED;
+ }
+ 
+-static inline int pmd_dirty(pmd_t pmd)
++static inline bool pmd_dirty(pmd_t pmd)
+ {
+-	return pmd_flags(pmd) & _PAGE_DIRTY;
++	return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
++}
++
++static inline bool pmd_shstk(pmd_t pmd)
++{
++	if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++		return false;
++
++	return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
++	       (_PAGE_DIRTY | _PAGE_PSE);
+ }
+ 
+ #define pmd_young pmd_young
+@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
+ 	return pmd_flags(pmd) & _PAGE_ACCESSED;
+ }
+ 
+-static inline int pud_dirty(pud_t pud)
++static inline bool pud_dirty(pud_t pud)
+ {
+-	return pud_flags(pud) & _PAGE_DIRTY;
++	return pud_flags(pud) & _PAGE_DIRTY_BITS;
+ }
+ 
+ static inline int pud_young(pud_t pud)
+@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)
+ 
+ static inline int pte_write(pte_t pte)
+ {
+-	return pte_flags(pte) & _PAGE_RW;
++	/*
++	 * Shadow stack pages are logically writable, but do not have
++	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
++	 */
++	return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
+ }
+ 
+ #define pmd_write pmd_write
+ static inline int pmd_write(pmd_t pmd)
+ {
+-	return pmd_flags(pmd) & _PAGE_RW;
++	/*
++	 * Shadow stack pages are logically writable, but do not have
++	 * _PAGE_RW.  Check for them separately from _PAGE_RW itself.
++	 */
++	return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
+ }
+ 
+ #define pud_write pud_write
+@@ -342,7 +367,16 @@ static inline pte_t pte_clear_saveddirty(pte_t pte)
+ 
+ static inline pte_t pte_wrprotect(pte_t pte)
+ {
+-	return pte_clear_flags(pte, _PAGE_RW);
++	pte = pte_clear_flags(pte, _PAGE_RW);
++
++	/*
++	 * Blindly clearing _PAGE_RW might accidentally create
++	 * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
++	 * dirty value to the software bit.
++	 */
++	if (pte_dirty(pte))
++		pte = pte_mksaveddirty(pte);
++	return pte;
+ }
+ 
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -380,7 +414,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
+ 
+ static inline pte_t pte_mkclean(pte_t pte)
+ {
+-	return pte_clear_flags(pte, _PAGE_DIRTY);
++	return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
+ }
+ 
+ static inline pte_t pte_mkold(pte_t pte)
+@@ -395,7 +429,19 @@ static inline pte_t pte_mkexec(pte_t pte)
+ 
+ static inline pte_t pte_mkdirty(pte_t pte)
+ {
+-	return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++	pteval_t dirty = _PAGE_DIRTY;
++
++	/* Avoid creating Dirty=1,Write=0 PTEs */
++	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
++		dirty = _PAGE_SAVED_DIRTY;
++
++	return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pte_t pte_mkwrite_shstk(pte_t pte)
++{
++	/* pte_clear_saveddirty() also sets Dirty=1 */
++	return pte_clear_saveddirty(pte);
+ }
+ 
+ static inline pte_t pte_mkyoung(pte_t pte)
+@@ -412,7 +458,12 @@ struct vm_area_struct;
+ 
+ static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+ {
+-	return pte_mkwrite_kernel(pte);
++	pte = pte_mkwrite_kernel(pte);
++
++	if (pte_dirty(pte))
++		pte = pte_clear_saveddirty(pte);
++
++	return pte;
+ }
+ 
+ static inline pte_t pte_mkhuge(pte_t pte)
+@@ -481,7 +532,15 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+ 
+ static inline pmd_t pmd_wrprotect(pmd_t pmd)
+ {
+-	return pmd_clear_flags(pmd, _PAGE_RW);
++	pmd = pmd_clear_flags(pmd, _PAGE_RW);
++	/*
++	 * Blindly clearing _PAGE_RW might accidentally create
++	 * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
++	 * dirty value to the software bit.
++	 */
++	if (pmd_dirty(pmd))
++		pmd = pmd_mksaveddirty(pmd);
++	return pmd;
+ }
+ 
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -508,12 +567,23 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
+ 
+ static inline pmd_t pmd_mkclean(pmd_t pmd)
+ {
+-	return pmd_clear_flags(pmd, _PAGE_DIRTY);
++	return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
+ }
+ 
+ static inline pmd_t pmd_mkdirty(pmd_t pmd)
+ {
+-	return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++	pmdval_t dirty = _PAGE_DIRTY;
++
++	/* Avoid creating (HW)Dirty=1, Write=0 PMDs */
++	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
++		dirty = _PAGE_SAVED_DIRTY;
++
++	return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
++{
++	return pmd_clear_saveddirty(pmd);
+ }
+ 
+ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+@@ -533,7 +603,12 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
+ 
+ static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+ {
+-	return pmd_set_flags(pmd, _PAGE_RW);
++	pmd = pmd_set_flags(pmd, _PAGE_RW);
++
++	if (pmd_dirty(pmd))
++		pmd = pmd_clear_saveddirty(pmd);
++
++	return pmd;
+ }
+ 
+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
+@@ -577,17 +652,32 @@ static inline pud_t pud_mkold(pud_t pud)
+ 
+ static inline pud_t pud_mkclean(pud_t pud)
+ {
+-	return pud_clear_flags(pud, _PAGE_DIRTY);
++	return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
+ }
+ 
+ static inline pud_t pud_wrprotect(pud_t pud)
+ {
+-	return pud_clear_flags(pud, _PAGE_RW);
++	pud = pud_clear_flags(pud, _PAGE_RW);
++
++	/*
++	 * Blindly clearing _PAGE_RW might accidentally create
++	 * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
++	 * dirty value to the software bit.
++	 */
++	if (pud_dirty(pud))
++		pud = pud_mksaveddirty(pud);
++	return pud;
+ }
+ 
+ static inline pud_t pud_mkdirty(pud_t pud)
+ {
+-	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++	pudval_t dirty = _PAGE_DIRTY;
++
++	/* Avoid creating (HW)Dirty=1, Write=0 PUDs */
++	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
++		dirty = _PAGE_SAVED_DIRTY;
++
++	return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
+ }
+ 
+ static inline pud_t pud_mkdevmap(pud_t pud)
+@@ -607,7 +697,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
+ 
+ static inline pud_t pud_mkwrite(pud_t pud)
+ {
+-	return pud_set_flags(pud, _PAGE_RW);
++	pud = pud_set_flags(pud, _PAGE_RW);
++
++	if (pud_dirty(pud))
++		pud = pud_clear_saveddirty(pud);
++	return pud;
+ }
+ 
+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+@@ -724,6 +818,8 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ {
+ 	pteval_t val = pte_val(pte), oldval = val;
++	bool wr_protected;
++	pte_t pte_result;
+ 
+ 	/*
+ 	 * Chop off the NX bit (if present), and add the NX portion of
+@@ -732,17 +828,43 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ 	val &= _PAGE_CHG_MASK;
+ 	val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+ 	val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
+-	return __pte(val);
++
++	pte_result = __pte(val);
++
++	/*
++	 * Do the saveddirty fixup if the PTE was just write protected and
++	 * it's dirty.
++	 */
++	wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++	    (val & _PAGE_DIRTY))
++		pte_result = pte_mksaveddirty(pte_result);
++
++	return pte_result;
+ }
+ 
+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+ {
+ 	pmdval_t val = pmd_val(pmd), oldval = val;
++	bool wr_protected;
++	pmd_t pmd_result;
+ 
+-	val &= _HPAGE_CHG_MASK;
++	val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
+ 	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ 	val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
+-	return __pmd(val);
++
++	pmd_result = __pmd(val);
++
++	/*
++	 * Do the saveddirty fixup if the PMD was just write protected and
++	 * it's dirty.
++	 */
++	wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++	if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++	    (val & _PAGE_DIRTY))
++		pmd_result = pmd_mksaveddirty(pmd_result);
++
++	return pmd_result;
+ }
+ 
+ /*
 -- 
 2.17.1
 
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help