--- v5
+++ v8
@@ -1,113 +1,71 @@
-The x86 Control-flow Enforcement Technology (CET) feature includes a new
-type of memory called shadow stack. This shadow stack memory has some
-unusual properties, which require some core mm changes to function
-properly.
+The x86 Control-flow Enforcement Technology (CET) feature includes a
+new type of memory called shadow stack. This shadow stack memory has
+some unusual properties, which requires some core mm changes to
+function properly.
-One of the properties is that the shadow stack pointer (SSP), which is a
-CPU register that points to the shadow stack like the stack pointer points
-to the stack, can't be pointing outside of the 32 bit address space when
-the CPU is executing in 32 bit mode. It is desirable to prevent executing
-in 32 bit mode when shadow stack is enabled because the kernel can't easily
-support 32 bit signals.
+In userspace, shadow stack memory is writable only in very specific,
+controlled ways. However, since userspace can, even in the limited
+ways, modify shadow stack contents, the kernel treats it as writable
+memory. As a result, without additional work there would remain many
+ways for userspace to trigger the kernel to write arbitrary data to
+shadow stacks via get_user_pages(, FOLL_WRITE) based operations. To
+help userspace protect their shadow stacks, make this a little less
+exposed by blocking writable get_user_pages() operations for shadow
+stack VMAs.
-On x86 it is possible to transition to 32 bit mode without any special
-interaction with the kernel, by doing a "far call" to a 32 bit segment.
-So the shadow stack implementation can use this address space behavior
-as a feature, by enforcing that shadow stack memory is always crated
-outside of the 32 bit address space. This way userspace will trigger a
-general protection fault which will in turn trigger a segfault if it
-tries to transition to 32 bit mode with shadow stack enabled.
-
-This provides a clean error generating border for the user if they try
-attempt to do 32 bit mode shadow stack, rather than leave the kernel in a
-half working state for userspace to be surprised by.
-
-So to allow future shadow stack enabling patches to map shadow stacks
-out of the 32 bit address space, introduce MAP_ABOVE4G. The behavior
-is pretty much like MAP_32BIT, except that it has the opposite address
-range. The are a few differences though.
-
-If both MAP_32BIT and MAP_ABOVE4G are provided, the kernel will use the
-MAP_ABOVE4G behavior. Like MAP_32BIT, MAP_ABOVE4G is ignored in a 32 bit
-syscall.
-
-Since the default search behavior is top down, the normal kaslr base can
-be used for MAP_ABOVE4G. This is unlike MAP_32BIT which has to add it's
-own randomization in the bottom up case.
-
-For MAP_32BIT, only the bottom up search path is used. For MAP_ABOVE4G
-both are potentially valid, so both are used. In the bottomup search
-path, the default behavior is already consistent with MAP_ABOVE4G since
-mmap base should be above 4GB.
-
-Without MAP_ABOVE4G, the shadow stack will already normally be above 4GB.
-So without introducing MAP_ABOVE4G, trying to transition to 32 bit mode
-with shadow stack enabled would usually segfault anyway. This is already
-pretty decent guard rails. But the addition of MAP_ABOVE4G is some small
-complexity spent to make it make it more complete.
+Still allow FOLL_FORCE to write through shadow stack protections, as it
+does for read-only protections. This is required for debugging use
+cases.
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
+Reviewed-by: Kees Cook <keescook@chromium.org>
+Acked-by: Mike Rapoport (IBM) <rppt@kernel.org>
+Acked-by: David Hildenbrand <david@redhat.com>
+Tested-by: Pengfei Xu <pengfei.xu@intel.com>
+Tested-by: John Allen <john.allen@amd.com>
+Tested-by: Kees Cook <keescook@chromium.org>
---
+v8:
+ - Update commit log verbiage (Boris, AndyL)
-v5:
- - New patch
+v3:
+ - Add comment in __pte_access_permitted() (Dave)
+ - Remove unneeded shadow stack specific check in
+ __pte_access_permitted() (Jann)
+---
+ arch/x86/include/asm/pgtable.h | 5 +++++
+ mm/gup.c | 2 +-
+ 2 files changed, 6 insertions(+), 1 deletion(-)
- arch/x86/include/uapi/asm/mman.h | 1 +
- arch/x86/kernel/sys_x86_64.c | 6 +++++-
- include/linux/mman.h | 4 ++++
- 3 files changed, 10 insertions(+), 1 deletion(-)
-
-diff --git a/arch/x86/include/uapi/asm/mman.h b/arch/x86/include/uapi/asm/mman.h
-index 775dbd3aff73..5a0256e73f1e 100644
---- a/arch/x86/include/uapi/asm/mman.h
-+++ b/arch/x86/include/uapi/asm/mman.h
-@@ -3,6 +3,7 @@
- #define _ASM_X86_MMAN_H
+diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
+index d81e7ec27507..2e3d8cca1195 100644
+--- a/arch/x86/include/asm/pgtable.h
++++ b/arch/x86/include/asm/pgtable.h
+@@ -1638,6 +1638,11 @@ static inline bool __pte_access_permitted(unsigned long pteval, bool write)
+ {
+ unsigned long need_pte_bits = _PAGE_PRESENT|_PAGE_USER;
- #define MAP_32BIT 0x40 /* only give out 32bit addresses */
-+#define MAP_ABOVE4G 0x80 /* only map above 4GB */
++ /*
++ * Write=0,Dirty=1 PTEs are shadow stack, which the kernel
++ * shouldn't generally allow access to, but since they
++ * are already Write=0, the below logic covers both cases.
++ */
+ if (write)
+ need_pte_bits |= _PAGE_RW;
- #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
- #define arch_calc_vm_prot_bits(prot, key) ( \
-diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
-index 8cc653ffdccd..06378b5682c1 100644
---- a/arch/x86/kernel/sys_x86_64.c
-+++ b/arch/x86/kernel/sys_x86_64.c
-@@ -193,7 +193,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
+diff --git a/mm/gup.c b/mm/gup.c
+index eab18ba045db..e7c7bcc0e268 100644
+--- a/mm/gup.c
++++ b/mm/gup.c
+@@ -978,7 +978,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
+ return -EFAULT;
- info.flags = VM_UNMAPPED_AREA_TOPDOWN;
- info.length = len;
-- info.low_limit = PAGE_SIZE;
-+ if (!in_32bit_syscall() && (flags & MAP_ABOVE4G))
-+ info.low_limit = 0x100000000;
-+ else
-+ info.low_limit = PAGE_SIZE;
-+
- info.high_limit = get_mmap_base(0);
-
- /*
-diff --git a/include/linux/mman.h b/include/linux/mman.h
-index 58b3abd457a3..32156daa985a 100644
---- a/include/linux/mman.h
-+++ b/include/linux/mman.h
-@@ -15,6 +15,9 @@
- #ifndef MAP_32BIT
- #define MAP_32BIT 0
- #endif
-+#ifndef MAP_ABOVE4G
-+#define MAP_ABOVE4G 0
-+#endif
- #ifndef MAP_HUGE_2MB
- #define MAP_HUGE_2MB 0
- #endif
-@@ -50,6 +53,7 @@
- | MAP_STACK \
- | MAP_HUGETLB \
- | MAP_32BIT \
-+ | MAP_ABOVE4G \
- | MAP_HUGE_2MB \
- | MAP_HUGE_1GB)
-
+ if (write) {
+- if (!(vm_flags & VM_WRITE)) {
++ if (!(vm_flags & VM_WRITE) || (vm_flags & VM_SHADOW_STACK)) {
+ if (!(gup_flags & FOLL_FORCE))
+ return -EFAULT;
+ /* hugetlb does not support FOLL_FORCE|FOLL_WRITE. */
--
2.17.1