Re: [RFC PATCH v3 00/15] pkeys-based page table hardening

From: Maxwell Bland <hidden>
Date: 2025-03-06 16:24:44
Also in: linux-mm, lkml

On Mon, Feb 03, 2025 at 10:18:24AM +0000, Kevin Brodsky wrote:

This is a proposal to leverage protection keys (pkeys) to harden
critical kernel data, by making it mostly read-only. The series includes
a simple framework called "kpkeys" to manipulate pkeys for in-kernel use,
as well as a page table hardening feature based on that framework
(kpkeys_hardened_pgtables). Both are implemented on arm64 as a proof of
concept, but they are designed to be compatible with any architecture
implementing pkeys.

Hi Kevin,

This is awesome. When working on some of these problems, I also thought
of leveraging the POE feature, but was not sure of a good way to make
it work.

Page tables were chosen as they are a popular (and critical) target for
attacks, but there are of course many others - this is only a starting
point (see section "Further use-cases"). It has become more and more
common for accesses to such target data to be mediated by a hypervisor
in vendor kernels; the hope is that kpkeys can provide much of that
protection in a simpler manner. No benchmarking has been performed at
this stage, but the runtime overhead should also be lower (though likely
not negligible).

Some notes here, having implemented similar page table protections,
albeit using stage-2 page table permissions and a fault handler.

https://lore.kernel.org/all/2wf4kmoqqmod6njviaq33lbxbx6gvdqbksljxykgltwnxo6ruy@7ueumwmxxh72/ (local)

I wanted to know your thoughts on associating specific policies to
page table updates in cases where an adversary is able to corrupt
other state associated with parameters to the page table infrastructure
code, e.g.

arch/arm64/net/bpf_jit_comp.c
2417:		if (set_memory_rw(PAGE_MASK & ((uintptr_t)&plt->target),

Is this something we would assume is handled via the security_* hook
infrastructure, a shadow stack CFI method, or changing the kernel code
to reverify the data non-modularly, some combination of the above?

- Pages in the linear mapping are assigned a pkey using set_memory_pkey().
  This is sufficient for this series, but of course higher-level
  interfaces can be introduced later to ask allocators to return pages
  marked with a given pkey. It should also be possible to extend this to
  vmalloc() if needed.

One of the interesting points here, acknowledged below, was that this
relies on having guarantees around the PC value/CFI of the function.
Since this is the baseline assumption, it feels natural that the
locking/unlocking would be associated into the existing CFI
instrumentation, since (from experience) point-patching each mutable
data structure allocation/deallocation is difficult.

# Further use-cases

It should be possible to harden various targets using kpkeys, including:

- struct cred (enforcing a "mostly read-only" state once committed)

- fixmap (occasionally used even after early boot, e.g.
  set_swapper_pgd() in arch/arm64/mm/mmu.c)

- SELinux state (e.g. struct selinux_state::initialized)

... and many others.

Be wary that it is not just struct cred but pointers to struct cred.  We
quickly run into a problem of, for example, in updates to f_op,
bookkeeping which f_op is OK to have in the file, e.g., in Android's
6.1.x:

drivers/gpu/drm/i810/i810_dma.c
138:    file_priv->filp->f_op = &i810_buffer_fops;
144:    file_priv->filp->f_op = old_fops;

drivers/staging/android/ashmem.c
436:            vmfile->f_op = &vmfile_fops;

Where overwriting f_op is a "classic" bypass of protection systems like
this one.

I think this problem may be totally solvable if POE was integrated into
something like CFI, since we can guarantee only the code that sets f_op
to "vmfile_fops" can unlock/relock the file's page.

Maybe another approach would work better, though?

# Open questions

A few aspects in this RFC that are debatable and/or worth discussing:

- There is currently no restriction on how kpkeys levels map to pkeys
  permissions. A typical approach is to allocate one pkey per level and
  make it writable at that level only. As the number of levels
  increases, we may however run out of pkeys, especially on arm64 (just
  8 pkeys with POE). Depending on the use-cases, it may be acceptable to
  use the same pkey for the data associated to multiple levels.

Honestly, I associate each protected virtual page in stage-2 with a
unique tag (manually, right now, but Kees Cook has some magic that
does the same via alloc_tag.h), and this works really well to track
specific resources and resource modification semantics "over" a generic
protection ring.

I think, though, that the code you provided could be used to bootstrap
such a system by using the overlay to protect a similar page tag lookup
table, which then can provide the fine-grained protection semantics.

I.e. use this baseline to isolate a secure monitor system.

Hopefully that makes sense! (-:

- kpkeys_set_level() and kpkeys_restore_pkey_reg() are not symmetric:
  the former takes a kpkeys level and returns a pkey register value, to
  be consumed by the latter. It would be more intuitive to manipulate
  kpkeys levels only. However this assumes that there is a 1:1 mapping
  between kpkeys levels and pkey register values, while in principle
  the mapping is 1:n (certain pkeys may be used outside the kpkeys
  framework).

Another issue I'm not confident in is the assumption of adversary's
inability to manipulate system control registers. This is true in the
context of a Heki-like system (or any well-made HVCI), but not totally
true of a pure EL1 implementation?

- An architecture that supports kpkeys is expected to select
  CONFIG_ARCH_HAS_KPKEYS and always enable them if available - there is
  no CONFIG_KPKEYS to control this behaviour. Since this creates no
  significant overhead (at least on arm64), it seemed better to keep it
  simple. Each hardening feature does have its own option and arch
  opt-in if needed (CONFIG_KPKEYS_HARDENED_PGTABLES,
  CONFIG_ARCH_HAS_KPKEYS_HARDENED_PGTABLES).

There's so many pieces of data though/data structures! In this model,
you'd have a separate switch for thousands of types of data! But I do
think protecting PTs is the first step to a more complicated security
monitor, since it allows you to have integrity for specific physical
pages (or IPAs).


Any comment or feedback will be highly appreciated, be it on the
high-level approach or implementation choices!

Last note, I'd not totallllyyy trust the compiler to inline the
functions.... I've met cases where functions on memory protections
I expected to be inlined were not. I think __forceinline *may* work
here, vs standard "static inline", but am not confident/sure.

Hopefully the above is valuable at all. Thanks!

Maxwell Bland

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help