--- v9
+++ v8
@@ -1,176 +1,350 @@
-The x86 Control-flow Enforcement Technology (CET) feature includes a new
-type of memory called shadow stack. This shadow stack memory has some
-unusual properties, which requires some core mm changes to function
-properly.
-
-The architecture of shadow stack constrains the ability of userspace to
-move the shadow stack pointer (SSP) in order to prevent corrupting or
-switching to other shadow stacks. The RSTORSSP instruction can move the
-SSP to different shadow stacks, but it requires a specially placed token
-in order to do this. However, the architecture does not prevent
-incrementing the stack pointer to wander onto an adjacent shadow stack. To
-prevent this in software, enforce guard pages at the beginning of shadow
-stack VMAs, such that there will always be a gap between adjacent shadow
-stacks.
-
-Make the gap big enough so that no userspace SSP changing operations
-(besides RSTORSSP), can move the SSP from one stack to the next. The
-SSP can be incremented or decremented by CALL, RET and INCSSP. CALL and
-RET can move the SSP by a maximum of 8 bytes, at which point the shadow
-stack would be accessed.
-
-The INCSSP instruction can also increment the shadow stack pointer. It
-is the shadow stack analog of an instruction like:
-
- addq $0x80, %rsp
-
-However, there is one important difference between an ADD on %rsp and
-INCSSP. In addition to modifying SSP, INCSSP also reads from the memory
-of the first and last elements that were "popped". It can be thought of
-as acting like this:
-
-READ_ONCE(ssp); // read+discard top element on stack
-ssp += nr_to_pop * 8; // move the shadow stack
-READ_ONCE(ssp-8); // read+discard last popped stack element
-
-The maximum distance INCSSP can move the SSP is 2040 bytes, before it
-would read the memory. Therefore, a single page gap will be enough to
-prevent any operation from shifting the SSP to an adjacent stack, since
-it would have to land in the gap at least once, causing a fault.
-
-This could be accomplished by using VM_GROWSDOWN, but this has a
-downside. The behavior would allow shadow stacks to grow, which is
-unneeded and adds a strange difference to how most regular stacks work.
-
-In the maple tree code, there is some logic for retrying the unmapped
-area search if a guard gap is violated. This retry should happen for
-shadow stack guard gap violations as well. This logic currently only
-checks for VM_GROWSDOWN for start gaps. Since shadow stacks also have
-a start gap as well, create an new define VM_STARTGAP_FLAGS to hold
-all the VM flag bits that have start gaps, and make mmap use it.
+The recently introduced _PAGE_SAVED_DIRTY should be used instead of the
+HW Dirty bit whenever a PTE is Write=0, in order to not inadvertently
+create shadow stack PTEs. Update pte_mk*() helpers to do this, and apply
+the same changes to pmd and pud.
+
+For pte_modify() this is a bit trickier. It takes a "raw" pgprot_t which
+was not necessarily created with any of the existing PTE bit helpers.
+That means that it can return a pte_t with Write=0,Dirty=1, a shadow
+stack PTE, when it did not intend to create one.
+
+Modify it to also move _PAGE_DIRTY to _PAGE_SAVED_DIRTY. To avoid
+creating Write=0,Dirty=1 PTEs, pte_modify() needs to avoid:
+1. Marking Write=0 PTEs Dirty=1
+2. Marking Dirty=1 PTEs Write=0
+
+The first case cannot happen as the existing behavior of pte_modify() is to
+filter out any Dirty bit passed in newprot. Handle the second case by
+shifting _PAGE_DIRTY=1 to _PAGE_SAVED_DIRTY=1 if the PTE was write
+protected by the pte_modify() call. Apply the same changes to
+pmd_modify().
Co-developed-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Yu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
-Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
Tested-by: Pengfei Xu <pengfei.xu@intel.com>
Tested-by: John Allen <john.allen@amd.com>
Tested-by: Kees Cook <keescook@chromium.org>
---
-v9:
- - Add logic needed to still have guard gaps with maple tree.
+v6:
+ - Rename _PAGE_COW to _PAGE_SAVED_DIRTY (David Hildenbrand)
+ - Open code _PAGE_SAVED_DIRTY part in pte_modify() (Boris)
+ - Change the logic so the open coded part is not too ugly
+ - Merge pte_modify() patch with this one because of the above
+
+v4:
+ - Break part patch for better bisectability
---
- include/linux/mm.h | 54 ++++++++++++++++++++++++++++++++++++++++------
- mm/mmap.c | 4 ++--
- 2 files changed, 50 insertions(+), 8 deletions(-)
-
-diff --git a/include/linux/mm.h b/include/linux/mm.h
-index fb17cbd531ac..535c58d3b2e4 100644
---- a/include/linux/mm.h
-+++ b/include/linux/mm.h
-@@ -342,7 +342,36 @@ extern unsigned int kobjsize(const void *objp);
- #endif /* CONFIG_ARCH_HAS_PKEYS */
-
- #ifdef CONFIG_X86_USER_SHADOW_STACK
--# define VM_SHADOW_STACK VM_HIGH_ARCH_5 /* Should not be set with VM_SHARED */
-+/*
-+ * This flag should not be set with VM_SHARED because of lack of support
-+ * core mm. It will also get a guard page. This helps userspace protect
-+ * itself from attacks. The reasoning is as follows:
-+ *
-+ * The shadow stack pointer(SSP) is moved by CALL, RET, and INCSSPQ. The
-+ * INCSSP instruction can increment the shadow stack pointer. It is the
-+ * shadow stack analog of an instruction like:
-+ *
-+ * addq $0x80, %rsp
-+ *
-+ * However, there is one important difference between an ADD on %rsp
-+ * and INCSSP. In addition to modifying SSP, INCSSP also reads from the
-+ * memory of the first and last elements that were "popped". It can be
-+ * thought of as acting like this:
-+ *
-+ * READ_ONCE(ssp); // read+discard top element on stack
-+ * ssp += nr_to_pop * 8; // move the shadow stack
-+ * READ_ONCE(ssp-8); // read+discard last popped stack element
-+ *
-+ * The maximum distance INCSSP can move the SSP is 2040 bytes, before
-+ * it would read the memory. Therefore a single page gap will be enough
-+ * to prevent any operation from shifting the SSP to an adjacent stack,
-+ * since it would have to land in the gap at least once, causing a
-+ * fault.
-+ *
-+ * Prevent using INCSSP to move the SSP between shadow stacks by
-+ * having a PAGE_SIZE guard gap.
-+ */
-+# define VM_SHADOW_STACK VM_HIGH_ARCH_5
- #else
- # define VM_SHADOW_STACK VM_NONE
- #endif
-@@ -405,6 +434,8 @@ extern unsigned int kobjsize(const void *objp);
- #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
- #endif
-
-+#define VM_STARTGAP_FLAGS (VM_GROWSDOWN | VM_SHADOW_STACK)
-+
- #ifdef CONFIG_STACK_GROWSUP
- #define VM_STACK VM_GROWSUP
- #else
-@@ -3235,15 +3266,26 @@ struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
- return mtree_load(&mm->mm_mt, addr);
- }
-
-+static inline unsigned long stack_guard_start_gap(struct vm_area_struct *vma)
+ arch/x86/include/asm/pgtable.h | 168 ++++++++++++++++++++++++++++-----
+ 1 file changed, 145 insertions(+), 23 deletions(-)
+
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index 349fcab0405a..05dfdbdf96b4 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -124,9 +124,17 @@ extern pmdval_t early_pmd_flags;
+ * The following only work if pte_present() is true.
+ * Undefined behaviour if not..
+ */
+-static inline int pte_dirty(pte_t pte)
++static inline bool pte_dirty(pte_t pte)
+ {
+- return pte_flags(pte) & _PAGE_DIRTY;
++ return pte_flags(pte) & _PAGE_DIRTY_BITS;
++}
++
++static inline bool pte_shstk(pte_t pte)
+{
-+ if (vma->vm_flags & VM_GROWSDOWN)
-+ return stack_guard_gap;
-+
-+ /* See reasoning around the VM_SHADOW_STACK definition */
-+ if (vma->vm_flags & VM_SHADOW_STACK)
-+ return PAGE_SIZE;
-+
-+ return 0;
++ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++ return false;
++
++ return (pte_flags(pte) & (_PAGE_RW | _PAGE_DIRTY)) == _PAGE_DIRTY;
+ }
+
+ static inline int pte_young(pte_t pte)
+@@ -134,9 +142,18 @@ static inline int pte_young(pte_t pte)
+ return pte_flags(pte) & _PAGE_ACCESSED;
+ }
+
+-static inline int pmd_dirty(pmd_t pmd)
++static inline bool pmd_dirty(pmd_t pmd)
+ {
+- return pmd_flags(pmd) & _PAGE_DIRTY;
++ return pmd_flags(pmd) & _PAGE_DIRTY_BITS;
+}
+
- static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
- {
-+ unsigned long gap = stack_guard_start_gap(vma);
- unsigned long vm_start = vma->vm_start;
-
-- if (vma->vm_flags & VM_GROWSDOWN) {
-- vm_start -= stack_guard_gap;
-- if (vm_start > vma->vm_start)
-- vm_start = 0;
-- }
-+ vm_start -= gap;
-+ if (vm_start > vma->vm_start)
-+ vm_start = 0;
- return vm_start;
- }
-
-diff --git a/mm/mmap.c b/mm/mmap.c
-index afdf5f78432b..d4793600a8d4 100644
---- a/mm/mmap.c
-+++ b/mm/mmap.c
-@@ -1570,7 +1570,7 @@ static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
- gap = mas.index;
- gap += (info->align_offset - gap) & info->align_mask;
- tmp = mas_next(&mas, ULONG_MAX);
-- if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */
-+ if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
- if (vm_start_gap(tmp) < gap + length - 1) {
- low_limit = tmp->vm_end;
- mas_reset(&mas);
-@@ -1622,7 +1622,7 @@ static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
- gap -= (gap - info->align_offset) & info->align_mask;
- gap_end = mas.last;
- tmp = mas_next(&mas, ULONG_MAX);
-- if (tmp && (tmp->vm_flags & VM_GROWSDOWN)) { /* Avoid prev check if possible */
-+ if (tmp && (tmp->vm_flags & VM_STARTGAP_FLAGS)) { /* Avoid prev check if possible */
- if (vm_start_gap(tmp) <= gap_end) {
- high_limit = vm_start_gap(tmp);
- mas_reset(&mas);
++static inline bool pmd_shstk(pmd_t pmd)
++{
++ if (!cpu_feature_enabled(X86_FEATURE_USER_SHSTK))
++ return false;
++
++ return (pmd_flags(pmd) & (_PAGE_RW | _PAGE_DIRTY | _PAGE_PSE)) ==
++ (_PAGE_DIRTY | _PAGE_PSE);
+ }
+
+ #define pmd_young pmd_young
+@@ -145,9 +162,9 @@ static inline int pmd_young(pmd_t pmd)
+ return pmd_flags(pmd) & _PAGE_ACCESSED;
+ }
+
+-static inline int pud_dirty(pud_t pud)
++static inline bool pud_dirty(pud_t pud)
+ {
+- return pud_flags(pud) & _PAGE_DIRTY;
++ return pud_flags(pud) & _PAGE_DIRTY_BITS;
+ }
+
+ static inline int pud_young(pud_t pud)
+@@ -157,13 +174,21 @@ static inline int pud_young(pud_t pud)
+
+ static inline int pte_write(pte_t pte)
+ {
+- return pte_flags(pte) & _PAGE_RW;
++ /*
++ * Shadow stack pages are logically writable, but do not have
++ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
++ */
++ return (pte_flags(pte) & _PAGE_RW) || pte_shstk(pte);
+ }
+
+ #define pmd_write pmd_write
+ static inline int pmd_write(pmd_t pmd)
+ {
+- return pmd_flags(pmd) & _PAGE_RW;
++ /*
++ * Shadow stack pages are logically writable, but do not have
++ * _PAGE_RW. Check for them separately from _PAGE_RW itself.
++ */
++ return (pmd_flags(pmd) & _PAGE_RW) || pmd_shstk(pmd);
+ }
+
+ #define pud_write pud_write
+@@ -342,7 +367,16 @@ static inline pte_t pte_clear_saveddirty(pte_t pte)
+
+ static inline pte_t pte_wrprotect(pte_t pte)
+ {
+- return pte_clear_flags(pte, _PAGE_RW);
++ pte = pte_clear_flags(pte, _PAGE_RW);
++
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PTE (Write=0,Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pte_dirty(pte))
++ pte = pte_mksaveddirty(pte);
++ return pte;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -380,7 +414,7 @@ static inline pte_t pte_clear_uffd_wp(pte_t pte)
+
+ static inline pte_t pte_mkclean(pte_t pte)
+ {
+- return pte_clear_flags(pte, _PAGE_DIRTY);
++ return pte_clear_flags(pte, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pte_t pte_mkold(pte_t pte)
+@@ -395,7 +429,19 @@ static inline pte_t pte_mkexec(pte_t pte)
+
+ static inline pte_t pte_mkdirty(pte_t pte)
+ {
+- return pte_set_flags(pte, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pteval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating Dirty=1,Write=0 PTEs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pte_write(pte))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pte_set_flags(pte, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pte_t pte_mkwrite_shstk(pte_t pte)
++{
++ /* pte_clear_saveddirty() also sets Dirty=1 */
++ return pte_clear_saveddirty(pte);
+ }
+
+ static inline pte_t pte_mkyoung(pte_t pte)
+@@ -412,7 +458,12 @@ struct vm_area_struct;
+
+ static inline pte_t pte_mkwrite(pte_t pte, struct vm_area_struct *vma)
+ {
+- return pte_mkwrite_kernel(pte);
++ pte = pte_mkwrite_kernel(pte);
++
++ if (pte_dirty(pte))
++ pte = pte_clear_saveddirty(pte);
++
++ return pte;
+ }
+
+ static inline pte_t pte_mkhuge(pte_t pte)
+@@ -481,7 +532,15 @@ static inline pmd_t pmd_clear_saveddirty(pmd_t pmd)
+
+ static inline pmd_t pmd_wrprotect(pmd_t pmd)
+ {
+- return pmd_clear_flags(pmd, _PAGE_RW);
++ pmd = pmd_clear_flags(pmd, _PAGE_RW);
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PMD (RW=0, Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pmd_dirty(pmd))
++ pmd = pmd_mksaveddirty(pmd);
++ return pmd;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
+@@ -508,12 +567,23 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
+
+ static inline pmd_t pmd_mkclean(pmd_t pmd)
+ {
+- return pmd_clear_flags(pmd, _PAGE_DIRTY);
++ return pmd_clear_flags(pmd, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pmd_t pmd_mkdirty(pmd_t pmd)
+ {
+- return pmd_set_flags(pmd, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pmdval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating (HW)Dirty=1, Write=0 PMDs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pmd_write(pmd))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pmd_set_flags(pmd, dirty | _PAGE_SOFT_DIRTY);
++}
++
++static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
++{
++ return pmd_clear_saveddirty(pmd);
+ }
+
+ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+@@ -533,7 +603,12 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
+
+ static inline pmd_t pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
+ {
+- return pmd_set_flags(pmd, _PAGE_RW);
++ pmd = pmd_set_flags(pmd, _PAGE_RW);
++
++ if (pmd_dirty(pmd))
++ pmd = pmd_clear_saveddirty(pmd);
++
++ return pmd;
+ }
+
+ static inline pud_t pud_set_flags(pud_t pud, pudval_t set)
+@@ -577,17 +652,32 @@ static inline pud_t pud_mkold(pud_t pud)
+
+ static inline pud_t pud_mkclean(pud_t pud)
+ {
+- return pud_clear_flags(pud, _PAGE_DIRTY);
++ return pud_clear_flags(pud, _PAGE_DIRTY_BITS);
+ }
+
+ static inline pud_t pud_wrprotect(pud_t pud)
+ {
+- return pud_clear_flags(pud, _PAGE_RW);
++ pud = pud_clear_flags(pud, _PAGE_RW);
++
++ /*
++ * Blindly clearing _PAGE_RW might accidentally create
++ * a shadow stack PUD (RW=0, Dirty=1). Move the hardware
++ * dirty value to the software bit.
++ */
++ if (pud_dirty(pud))
++ pud = pud_mksaveddirty(pud);
++ return pud;
+ }
+
+ static inline pud_t pud_mkdirty(pud_t pud)
+ {
+- return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
++ pudval_t dirty = _PAGE_DIRTY;
++
++ /* Avoid creating (HW)Dirty=1, Write=0 PUDs */
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && !pud_write(pud))
++ dirty = _PAGE_SAVED_DIRTY;
++
++ return pud_set_flags(pud, dirty | _PAGE_SOFT_DIRTY);
+ }
+
+ static inline pud_t pud_mkdevmap(pud_t pud)
+@@ -607,7 +697,11 @@ static inline pud_t pud_mkyoung(pud_t pud)
+
+ static inline pud_t pud_mkwrite(pud_t pud)
+ {
+- return pud_set_flags(pud, _PAGE_RW);
++ pud = pud_set_flags(pud, _PAGE_RW);
++
++ if (pud_dirty(pud))
++ pud = pud_clear_saveddirty(pud);
++ return pud;
+ }
+
+ #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+@@ -724,6 +818,8 @@ static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
+ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ {
+ pteval_t val = pte_val(pte), oldval = val;
++ bool wr_protected;
++ pte_t pte_result;
+
+ /*
+ * Chop off the NX bit (if present), and add the NX portion of
+@@ -732,17 +828,43 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
+ val &= _PAGE_CHG_MASK;
+ val |= check_pgprot(newprot) & ~_PAGE_CHG_MASK;
+ val = flip_protnone_guard(oldval, val, PTE_PFN_MASK);
+- return __pte(val);
++
++ pte_result = __pte(val);
++
++ /*
++ * Do the saveddirty fixup if the PTE was just write protected and
++ * it's dirty.
++ */
++ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++ (val & _PAGE_DIRTY))
++ pte_result = pte_mksaveddirty(pte_result);
++
++ return pte_result;
+ }
+
+ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
+ {
+ pmdval_t val = pmd_val(pmd), oldval = val;
++ bool wr_protected;
++ pmd_t pmd_result;
+
+- val &= _HPAGE_CHG_MASK;
++ val &= (_HPAGE_CHG_MASK & ~_PAGE_DIRTY);
+ val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
+ val = flip_protnone_guard(oldval, val, PHYSICAL_PMD_PAGE_MASK);
+- return __pmd(val);
++
++ pmd_result = __pmd(val);
++
++ /*
++ * Do the saveddirty fixup if the PMD was just write protected and
++ * it's dirty.
++ */
++ wr_protected = (oldval & _PAGE_RW) && !(val & _PAGE_RW);
++ if (cpu_feature_enabled(X86_FEATURE_USER_SHSTK) && wr_protected &&
++ (val & _PAGE_DIRTY))
++ pmd_result = pmd_mksaveddirty(pmd_result);
++
++ return pmd_result;
+ }
+
+ /*
--
-2.34.1
-
+2.17.1
+