Re: [PATCH v10 8/9] powerpc/code-patching: Use temporary mm for Radix MMU
From: Nathan Chancellor <nathan@kernel.org>
Date: 2022-12-15 20:18:45
Also in:
llvm
Hi Benjamin, On Wed, Nov 09, 2022 at 03:51:11PM +1100, Benjamin Gray wrote:
From: "Christopher M. Riedl" <redacted>
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.
Mappings in the temporary mm can be set in the userspace portion of the
address-space.
Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.
Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.
Bits of entropy with 64K page size on BOOK3S_64:
bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K)
bits of entropy = 31
The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.
Randomization occurs only once during initialization for each CPU as it
comes online.
The patching page is mapped with PAGE_KERNEL to set EAA[0] for the PTE
which ignores the AMR (so no need to unlock/lock KUAP) according to
PowerISA v3.0b Figure 35 on Radix.
Based on x86 implementation:
commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")
and:
commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")
From: Benjamin Gray <redacted>
Synchronisation is done according to ISA 3.1B Book 3 Chapter 13
"Synchronization Requirements for Context Alterations". Switching the mm
is a change to the PID, which requires a CSI before and after the change,
and a hwsync between the last instruction that performs address
translation for an associated storage access.
Instruction fetch is an associated storage access, but the instruction
address mappings are not being changed, so it should not matter which
context they use. We must still perform a hwsync to guard arbitrary
prior code that may have accessed a userspace address.
TLB invalidation is local and VA specific. Local because only this core
used the patching mm, and VA specific because we only care that the
writable mapping is purged. Leaving the other mappings intact is more
efficient, especially when performing many code patches in a row (e.g.,
as ftrace would).
Signed-off-by: Christopher M. Riedl <redacted>
Signed-off-by: Benjamin Gray <redacted>
Apologies if this has already been reported or fixed, I did a quick
search of lore and found nothing. I just bisected a crash on boot in
QEMU to this commit in next-20221215 as c28c15b6d28a
("powerpc/code-patching: Use temporary mm for Radix MMU") (initrd is
available at [1], just 'zstd -d' before using it):
$ qemu-system-ppc64 \
-device ipmi-bmc-sim,id=bmc0 \
-device isa-ipmi-bt,bmc=bmc0,irq=10 \
-machine powernv \
-kernel arch/powerpc/boot/zImage.epapr
-display none \
-initrd rootfs.cpio \
-m 2G \
-nodefaults \
-no-reboot \
-serial mon:stdio
...
[ 0.000000] dt-cpu-ftrs: setup for ISA 3000
[ 0.000000] dt-cpu-ftrs: final cpu/mmu features = 0x0003c06b8f5fb187 0x3c007041
[ 0.000000] Activating Kernel Userspace Access Prevention
[ 0.000000] Activating Kernel Userspace Execution Prevention
[ 0.000000] radix-mmu: Mapped 0x0000000000000000-0x0000000002760000 with 64.0 KiB pages (exec)
[ 0.000000] radix-mmu: Mapped 0x0000000002760000-0x0000000080000000 with 64.0 KiB pages
[ 0.000000] radix-mmu: Initializing Radix MMU
[ 0.000000] Linux version 6.1.0-rc2+ (nathan@dev-arch.thelio-3990X) (powerpc64-linux-gcc (GCC) 10.4.0, GNU ld (GNU Binutils) 2.39) #1 SMP Thu Dec 15 12:26:19 MST 2022
[ 0.000000] Found initrd at 0xc000000028000000:0xc0000000288c7400
[ 0.000000] Hardware name: IBM PowerNV (emulated by qemu) POWER9 0x4e1200 opal:v7.0 PowerNV
...
[ 0.208320] ------------[ cut here ]------------
[ 0.210605] kernel BUG at arch/powerpc/mm/pgtable.c:333!
[ 0.212314] Oops: Exception in kernel mode, sig: 5 [#1]
[ 0.213324] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA PowerNV
[ 0.214793] Modules linked in:
[ 0.215781] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.1.0-rc2+ #1
[ 0.216643] Hardware name: IBM PowerNV (emulated by qemu) POWER9 0x4e1200 opal:v7.0 PowerNV
[ 0.217958] NIP: c000000000089730 LR: c000000000089720 CTR: 0000000000000000
[ 0.218949] REGS: c000000003587740 TRAP: 0700 Not tainted (6.1.0-rc2+)
[ 0.219891] MSR: 9000000002029033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 24002844 XER: 00000000
[ 0.221323] CFAR: c00000000008a140 IRQMASK: 1
[ 0.221323] GPR00: c000000000089720 c0000000035879e0 c000000001d97000 c00c00000000d4c0
[ 0.221323] GPR04: 0000000000000e08 0000000000000015 0000000003520009 0000000003530005
[ 0.221323] GPR08: 0000000005005303 0000000000000001 c00c000000000000 0000000000000009
[ 0.221323] GPR12: c0000000000898e0 c000000002cc0000 c000000000012678 0000000000000000
[ 0.221323] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 0.221323] GPR20: 0000000000000000 0000000000000000 0000000000000000 c0000000029e3b60
[ 0.221323] GPR24: 00004b4678380000 c00000000350d700 00004b4678380000 c0000000035300c0
[ 0.221323] GPR28: 0000000000000000 c000000002a3aef0 0000000060000000 0000000000000000
[ 0.230113] NIP [c000000000089730] assert_pte_locked+0x180/0x1a0
[ 0.232181] LR [c000000000089720] assert_pte_locked+0x170/0x1a0
[ 0.233380] Call Trace:
[ 0.234106] [c0000000035879e0] [0000000060000000] 0x60000000 (unreliable)
[ 0.235821] [c000000003587a00] [c0000000000a7c58] patch_instruction+0x618/0x6d0
[ 0.237128] [c000000003587a80] [c00000000005a53c] arch_prepare_kprobe+0xfc/0x2d0
[ 0.238420] [c000000003587b00] [c0000000002b4690] register_kprobe+0x520/0x7c0
[ 0.239763] [c000000003587b70] [c000000002011d3c] arch_init_kprobes+0x28/0x3c
[ 0.241842] [c000000003587b90] [c000000002035848] init_kprobes+0x108/0x184
[ 0.244858] [c000000003587c00] [c000000000012090] do_one_initcall+0x60/0x2e0
[ 0.248262] [c000000003587cd0] [c000000002004f40] kernel_init_freeable+0x1f0/0x3e0
[ 0.251865] [c000000003587da0] [c0000000000126a4] kernel_init+0x34/0x1d0
[ 0.254577] [c000000003587e10] [c00000000000cf5c] ret_from_kernel_thread+0x5c/0x64
[ 0.258051] Code: 7c0802a6 706900a0 7d290074 7929d182 f8010010 f821ffe1 0b090000 480009dd 60000000 81230028 7d290034 5529d97e <0b090000> 38210020 e8010010 7c0803a6
[ 0.264482] ---[ end trace 0000000000000000 ]---
[ 0.266857]
[ 1.269398] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000005
[ 1.274246] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000005 ]---
This was initially noticed in our CI [2] but it is not related to clang,
as I did the bisect with GCC 10.4.0 from [3]. Attached is the .config
but it is just Fedora's rawhide configuration [4] if you have to grab it
again.
If there is any further information I can provide or patches I can test,
I am more than happy to do so.
Cheers,
Nathan
[1]: https://github.com/ClangBuiltLinux/boot-utils/blob/64b7d421f4d60b45e09fa81f0fe3d4ad96c99d6c/images/ppc64le/rootfs.cpio.zst
[2]: https://github.com/ClangBuiltLinux/continuous-integration2/actions/runs/3703987970/jobs/6277411497
[3]: https://mirrors.edge.kernel.org/pub/tools/crosstool/
[4]: https://src.fedoraproject.org/rpms/kernel/raw/rawhide/f/kernel-ppc64le-fedora.config
# bad: [459c73db4069c27c1d4a0e20d055b837396364b8] Add linux-next specific files for 20221215
# good: [6f1f5caed5bfadd1cc8bdb0563eb8874dc3573ca] Merge tag 'for-linus-6.2-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
git bisect start '459c73db4069c27c1d4a0e20d055b837396364b8' '6f1f5caed5bfadd1cc8bdb0563eb8874dc3573ca'
# bad: [6cc557c9b10bbf1f95abb2a871a4c9a3e3705500] Merge branch 'timers/drivers/next' of git://git.linaro.org/people/daniel.lezcano/linux.git
git bisect bad 6cc557c9b10bbf1f95abb2a871a4c9a3e3705500
# bad: [e31516b742ca321c68ff69f63ecbcc5f3458a9d0] Merge branch 'dev' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs.git
git bisect bad e31516b742ca321c68ff69f63ecbcc5f3458a9d0
# good: [532890942f39cc3008e62d48674fb26b19500770] Merge branch 'for-next/core' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect good 532890942f39cc3008e62d48674fb26b19500770
# good: [fc9dbec4fb187b43d79613f8ad7a42164bd7f748] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git
git bisect good fc9dbec4fb187b43d79613f8ad7a42164bd7f748
# bad: [dad765add65d564591ad6bd26d3299d672cd20d4] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git bisect bad dad765add65d564591ad6bd26d3299d672cd20d4
# bad: [505ea33089dcfc3ee3201b0fcb94751165805413] powerpc/64: Add big-endian ELFv2 flavour to crypto VMX asm generation
git bisect bad 505ea33089dcfc3ee3201b0fcb94751165805413
# good: [5825603f67bc5ff445a1847302884154f0afa627] powerpc/microwatt: Add litesd
git bisect good 5825603f67bc5ff445a1847302884154f0afa627
# good: [a9ffb8ee7b65a468474d6a2be7e9cca4b8f8ea5f] powerpc: Use "grep -E" instead of "egrep"
git bisect good a9ffb8ee7b65a468474d6a2be7e9cca4b8f8ea5f
# good: [d5090716be6791ada9ee142163a4934c1c147aaa] powerpc/book3e: remove #include <generated/utsrelease.h>
git bisect good d5090716be6791ada9ee142163a4934c1c147aaa
# good: [0f0a0a6091e678b1a75078ecd6b02176f3228dbb] cxl: Use radix__flush_all_mm instead of generic flush_all_mm
git bisect good 0f0a0a6091e678b1a75078ecd6b02176f3228dbb
# bad: [c28c15b6d28a776538482101522cbcd9f906b15c] powerpc/code-patching: Use temporary mm for Radix MMU
git bisect bad c28c15b6d28a776538482101522cbcd9f906b15c
# good: [274d842fa1efd9449e62222c8896e0be11621f1f] powerpc/tlb: Add local flush for page given mm_struct and psize
git bisect good 274d842fa1efd9449e62222c8896e0be11621f1f
# first bad commit: [c28c15b6d28a776538482101522cbcd9f906b15c] powerpc/code-patching: Use temporary mm for Radix MMU