--- v5
+++ v8
@@ -1,146 +1,350 @@
-From: Yu-cheng Yu <yu-cheng.yu@intel.com>
-
-The CPU performs "shadow stack accesses" when it expects to encounter
-shadow stack mappings. These accesses can be implicit (via CALL/RET
-instructions) or explicit (instructions like WRSS).
-
-Shadow stack accesses to shadow-stack mappings can result in faults in
-normal, valid operation just like regular accesses to regular mappings.
-Shadow stacks need some of the same features like delayed allocation, swap
-and copy-on-write. The kernel needs to use faults to implement those
-features.
-
-The architecture has concepts of both shadow stack reads and shadow stack
-writes. Any shadow stack access to non-shadow stack memory will generate
-a fault with the shadow stack error code bit set.
-
-This means that, unlike normal write protection, the fault handler needs
-to create a type of memory that can be written to (with instructions that
-generate shadow stack writes), even to fulfill a read access. So in the
-case of COW memory, the COW needs to take place even with a shadow stack
-read. Otherwise the page will be left (shadow stack) writable in
-userspace. So to trigger the appropriate behavior, set FAULT_FLAG_WRITE
-for shadow stack accesses, even if the access was a shadow stack read.
-
-For the purpose of making this clearer, consider the following example.
-If a process has a shadow stack, and forks, the shadow stack PTEs will
-become read-only due to COW. If the CPU in one process performs a shadow
-stack read access to the shadow stack, for example executing a RET and
-causing the CPU to read the shadow stack copy of the return address, then
-in order for the fault to be resolved the PTE will need to be set with
-shadow stack permissions. But then the memory would be changeable from
-userspace (from CALL, RET, WRSS, etc). So this scenario needs to trigger
-COW, otherwise the shared page would be changeable from both processes.
-
-Shadow stack accesses can also result in errors, such as when a shadow
-stack overflows, or if a shadow stack access occurs to a non-shadow-stack
-mapping. Also, generate the errors for invalid shadow stack accesses.
-
+The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
+HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
+create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
+the same changes to pmd and pud.
+
+For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
+was not necessarily created with any of the existing PTE bit helpers.
+That means that it can return a pte_t with Write=0,Dirty=1, a shadow
+stack PTE, when it did not intend to create one.
+
+Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
+creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
+1. Marking Write=0 PTEs Dirty=1
+2. Marking Dirty=1 PTEs Write=0
+
+The first case cannot happen as the existing behavior of pte_modify() is to
+filter out any Dirty bit passed in newprot. Handle the second case by
+shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
+protected by the pte_modify() call. Apply the same changes to
+pmd_modify().
+
+Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
+Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
+Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Reviewed-by: Kees Cook <keescook@chromium.org>
+Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
-Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
-Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
-Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Tested-by: Kees Cook <keescook@chromium.org>
---
-
-v5:
- - Add description of COW example (Boris)
- - Replace "permissioned" (Boris)
- - Remove capitalization of shadow stack (Boris)
+v6:
+ - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
+ - Open code _PAGE_SAVED_DIRTY part in pte_modify() (Boris)
+ - Change the logic so the open coded part is not too ugly
+ - Merge pte_modify() patch with this one because of the above
v4:
- - Further improve comment talking about FAULT_FLAG_WRITE (Peterz)
-
-v3:
- - Improve comment talking about using FAULT_FLAG_WRITE (Peterz)
-
-v2:
- - Update commit log with verbiage/feedback from Dave Hansen
- - Clarify reasoning for FAULT_FLAG_WRITE for all shadow stack accesses
- - Update comments with some verbiage from Dave Hansen
-
- arch/x86/include/asm/trap_pf.h | 2 ++
- arch/x86/mm/fault.c | 38 ++++++++++++++++++++++++++++++++++
- 2 files changed, 40 insertions(+)
-
-diff --git a/arch/x86/include/asm/trap_pf.h b/arch/x86/include/asm/trap_pf.h
-index 10b1de500ab1..afa524325e55 100644
---- a/arch/x86/include/asm/trap_pf.h
-+++ b/arch/x86/include/asm/trap_pf.h
-@@ -11,6 +11,7 @@
- * bit 3 == 1: use of reserved bit detected
- * bit 4 == 1: fault was an instruction fetch
- * bit 5 == 1: protection keys block access
-+ * bit 6 == 1: shadow stack access fault
- * bit 15 == 1: SGX MMU page-fault
+ - Break part patch for better bisectability
+---
+ arch/x86/include/asm/pgtable.h | 168 ++++++++++++++++++++++++++++-----
+ 1 file changed, 145 insertions(+), 23 deletions(-)
+
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index 349fcab0405a..05dfdbdf96b4 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
+ * The following only work if pte_present() is true.
+ * Undefined behaviour if not..
*/
- enum x86_pf_error_code {
-@@ -20,6 +21,7 @@ enum x86_pf_error_code {
- X86_PF_RSVD = 1 << 3,
- X86_PF_INSTR = 1 << 4,
- X86_PF_PK = 1 << 5,
-+ X86_PF_SHSTK = 1 << 6,
- X86_PF_SGX = 1 << 15,
- };
-
-diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
-index 7b0d4ab894c8..070b50c87415 100644
---- a/arch/x86/mm/fault.c
-+++ b/arch/x86/mm/fault.c
-@@ -1138,8 +1138,22 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
- (error_code & X86_PF_INSTR), foreign))
- return 1;
-
-+ /*
-+ * Shadow stack accesses (PF_SHSTK=1) are only permitted to
-+ * shadow stack VMAs. All other accesses result in an error.
-+ */
-+ if (error_code & X86_PF_SHSTK) {
-+ if (unlikely(!(vma->vm_flags & VM_SHADOW_STACK)))
-+ return 1;
-+ if (unlikely(!(vma->vm_flags & VM_WRITE)))
-+ return 1;
-+ return 0;
-+ }
-+
- if (error_code & X86_PF_WRITE) {
- /* write, present and write, not present: */
-+ if (unlikely(vma->vm_flags & VM_SHADOW_STACK))
-+ return 1;
- if (unlikely(!(vma->vm_flags & VM_WRITE)))
- return 1;
- return 0;
-@@ -1331,6 +1345,30 @@ void do_user_addr_fault(struct pt_regs *regs,
-
- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
-+ /*
-+ * When a page becomes COW it changes from a shadow stack permission
-+ * page (Write=0,Dirty=1) to (Write=0,Dirty=0,CoW=1), which is simply
-+ * read-only to the CPU. When shadow stack is enabled, a RET would
-+ * normally pop the shadow stack by reading it with a "shadow stack
-+ * read" access. However, in the COW case the shadow stack memory does
-+ * not have shadow stack permissions, it is read-only. So it will
-+ * generate a fault.
-+ *
-+ * For conventionally writable pages, a read can be serviced with a
-+ * read only PTE, and COW would not have to happen. But for shadow
-+ * stack, there isn't the concept of read-only shadow stack memory.
-+ * If it is shadow stack permission, it can be modified via CALL and
-+ * RET instructions. So COW needs to happen before any memory can be
-+ * mapped with shadow stack permissions.
-+ *
-+ * Shadow stack accesses (read or write) need to be serviced with
-+ * shadow stack permission memory, so in the case of a shadow stack
-+ * read access, treat it as a WRITE fault so both COW will happen and
-+ * the write fault path will tickle maybe_mkwrite() and map the memory
-+ * shadow stack.
-+ */
-+ if (error_code & X86_PF_SHSTK)
-+ flags |= FAULT_FLAG_WRITE;
- if (error_code & X86_PF_WRITE)
- flags |= FAULT_FLAG_WRITE;
- if (error_code & X86_PF_INSTR)
+-static inline int pte_dirty(pte_t pte)
++static inline bool pte_dirty(pte_t pte)
+ {
+- return pte_flags(pte) & _PAGE_DIRTY;
++ return pte_flags(pte) & _PAGE_DIRTY_BITS;
++}
++
++static inline bool pte_shstk(pte_t pte)
++{
++ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++ return false;
++
++ return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
+ }
+
+ static inline int pte_young(pte_t pte)
+@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
+ return pte_flags(pte) & _PAGE_ACCESSED;
+ }
+
+-static inline int pmd_dirty(pmd_t pmd)
++static inline bool pmd_dirty(pmd_t pmd)
+ {
+- return pmd_flags(pmd) & _PAGE_DIRTY;
++ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
++}
++
++static inline bool pmd_shstk(pmd_t pmd)
++{
++ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++ return false;
++
++ return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
++ (_PAGE_DIRTY | _PAGE_PSE);
+ }
+
+ #define pmd_young pmd_young
+@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
+ return pmd_flags(pmd) & _PAGE_ACCESSED;
+ }
+
+-static inline int pud_dirty(pud_t pud)
++static inline bool pud_dirty(pud_t pud)
+ {
+- return pud_flags(pud) & _PAGE_DIRTY;
++ return pud_flags(pud) & _PAGE_DIRTY_BITS;
+ }
+
+ static inline int pud_young(pud_t pud)
+@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)
+
+ static inline int pte_write(pte_t pte)
+ {
+- return pte_flags(pte) & _PAGE_RW;
++ /*
++ * Shadow stack pages are logically writable, but do not have
++ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
++ */
++ return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
+ }
+
+ #define pmd_write pmd_write
+ static inline int pmd_write(pmd_t pmd)
+ {
+- return pmd_flags(pmd) & _PAGE_RW;
++ /*
++ * Shadow stack pages are logically writable, but do not have
++ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
++ */
++ return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
+ }
+
+ #define pud_write pud_write
+@@ -342,7 +367,16 @@ static inline pte_t pte_clear_saveddirty(pte_t pte)
+
+ static inline pte_t pte_wrprotect(pte_t pte)
+ {
+- return pte_clear_flags(pte, _PAGE_RW);
++ pte = pte_clear_flags(pte, _PAGE_RW);
++
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pte_dirty(pte))
++ pte = pte_mksaveddirty(pte);
++ return pte;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -380,7 +414,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
+
+ static inline pte_t pte_mkclean(pte_t pte)
+ {
+- return pte_clear_flags(pte, _PAGE_DIRTY);
++ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pte_t pte_mkold(pte_t pte)
+@@ -395,7 +429,19 @@ static inline pte_t pte_mkexec(pte_t pte)
+
+ static inline pte_t pte_mkdirty(pte_t pte)
+ {
+- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pteval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating Dirty=1,Write=0 PTEs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pte_t pte_mkwrite_shstk(pte_t pte)
++{
++ /* pte_clear_saveddirty() also sets Dirty=1 */
++ return pte_clear_saveddirty(pte);
+ }
+
+ static inline pte_t pte_mkyoung(pte_t pte)
+@@ -412,7 +458,12 @@ struct vm_area_struct;
+
+ static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+ {
+- return pte_mkwrite_kernel(pte);
++ pte = pte_mkwrite_kernel(pte);
++
++ if (pte_dirty(pte))
++ pte = pte_clear_saveddirty(pte);
++
++ return pte;
+ }
+
+ static inline pte_t pte_mkhuge(pte_t pte)
+@@ -481,7 +532,15 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+
+ static inline pmd_t pmd_wrprotect(pmd_t pmd)
+ {
+- return pmd_clear_flags(pmd, _PAGE_RW);
++ pmd = pmd_clear_flags(pmd, _PAGE_RW);
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pmd_dirty(pmd))
++ pmd = pmd_mksaveddirty(pmd);
++ return pmd;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -508,12 +567,23 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
+
+ static inline pmd_t pmd_mkclean(pmd_t pmd)
+ {
+- return pmd_clear_flags(pmd, _PAGE_DIRTY);
++ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pmd_t pmd_mkdirty(pmd_t pmd)
+ {
+- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pmdval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating (HW)Dirty=1, Write=0 PMDs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
++{
++ return pmd_clear_saveddirty(pmd);
+ }
+
+ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+@@ -533,7 +603,12 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
+
+ static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+ {
+- return pmd_set_flags(pmd, _PAGE_RW);
++ pmd = pmd_set_flags(pmd, _PAGE_RW);
++
++ if (pmd_dirty(pmd))
++ pmd = pmd_clear_saveddirty(pmd);
++
++ return pmd;
+ }
+
+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
+@@ -577,17 +652,32 @@ static inline pud_t pud_mkold(pud_t pud)
+
+ static inline pud_t pud_mkclean(pud_t pud)
+ {
+- return pud_clear_flags(pud, _PAGE_DIRTY);
++ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pud_t pud_wrprotect(pud_t pud)
+ {
+- return pud_clear_flags(pud, _PAGE_RW);
++ pud = pud_clear_flags(pud, _PAGE_RW);
++
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pud_dirty(pud))
++ pud = pud_mksaveddirty(pud);
++ return pud;
+ }
+
+ static inline pud_t pud_mkdirty(pud_t pud)
+ {
+- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pudval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating (HW)Dirty=1, Write=0 PUDs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
+ }
+
+ static inline pud_t pud_mkdevmap(pud_t pud)
+@@ -607,7 +697,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
+
+ static inline pud_t pud_mkwrite(pud_t pud)
+ {
+- return pud_set_flags(pud, _PAGE_RW);
++ pud = pud_set_flags(pud, _PAGE_RW);
++
++ if (pud_dirty(pud))
++ pud = pud_clear_saveddirty(pud);
++ return pud;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+@@ -724,6 +818,8 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ {
+ pteval_t val = pte_val(pte), oldval = val;
++ bool wr_protected;
++ pte_t pte_result;
+
+ /*
+ * Chop off the NX bit (if present), and add the NX portion of
+@@ -732,17 +828,43 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ val &= _PAGE_CHG_MASK;
+ val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+ val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
+- return __pte(val);
++
++ pte_result = __pte(val);
++
++ /*
++ * Do the saveddirty fixup if the PTE was just write protected and
++ * it's dirty.
++ */
++ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++ (val & _PAGE_DIRTY))
++ pte_result = pte_mksaveddirty(pte_result);
++
++ return pte_result;
+ }
+
+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+ {
+ pmdval_t val = pmd_val(pmd), oldval = val;
++ bool wr_protected;
++ pmd_t pmd_result;
+
+- val &= _HPAGE_CHG_MASK;
++ val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
+ val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
+- return __pmd(val);
++
++ pmd_result = __pmd(val);
++
++ /*
++ * Do the saveddirty fixup if the PMD was just write protected and
++ * it's dirty.
++ */
++ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++ (val & _PAGE_DIRTY))
++ pmd_result = pmd_mksaveddirty(pmd_result);
++
++ return pmd_result;
+ }
+
+ /*
--
2.17.1