Thread (27 messages) 27 messages, 8 authors, 2025-11-27

Re: [DISCUSSION] kstack offset randomization: bugs and performance

From: Mark Rutland <mark.rutland@arm.com>
Date: 2025-11-18 11:25:09
Also in: lkml

On Tue, Nov 18, 2025 at 10:28:29AM +0000, Ryan Roberts wrote:
On 17/11/2025 20:27, Kees Cook wrote:
quoted
On Mon, Nov 17, 2025 at 11:31:22AM +0000, Ryan Roberts wrote:
quoted
On 17/11/2025 11:30, Ryan Roberts wrote:
The original rationale for a separate choose_random_kstack_offset() at the end
of the syscall is described as:

 * This position in the syscall flow is done to
 * frustrate attacks from userspace attempting to learn the next offset:
 * - Maximize the timing uncertainty visible from userspace: if the
 *   offset is chosen at syscall entry, userspace has much more control
 *   over the timing between choosing offsets. "How long will we be in
 *   kernel mode?" tends to be more difficult to predict than "how long
 *   will we be in user mode?"
 * - Reduce the lifetime of the new offset sitting in memory during
 *   kernel mode execution. Exposure of "thread-local" memory content
 *   (e.g. current, percpu, etc) tends to be easier than arbitrary
 *   location memory exposure.

I'm not totally convinced by the first argument; for arches that use the tsc,
sampling the tsc at syscall entry would mean that userspace can figure out the
random value that will be used for syscall N by sampling the tsc and adding a
bit just before calling syscall N. Sampling the tsc at syscall exit would mean
that userspace can figure out the random value that will be used for syscall N
by sampling the tsc and subtracting a bit just after syscall N-1 returns. I
don't really see any difference in protection?

If you're trying force the kernel-sampled tsc to be a specific value, then for
the sample-on-exit case, userspace can just make a syscall with an invalid id as
it's syscall N-1 and in that case the duration between entry and exit is tiny
and fixed so it's still pretty simple to force the value.
FWIW, I agree. I don't think we're gaining much based on the placement
of choose_random_kstack_offset() at the start/end of the entry/exit
sequences.

As an aside, it looks like x86 calls choose_random_kstack_offset() for
*any* return to userspace, including non-syscall returns (e.g. from
IRQ), in arch_exit_to_user_mode_prepare(). There's some additional
randomness/perturbation that'll cause, but logically it's not necessary
to do that for *all* returns to userspace.
So what do you think of this approach? :

#define add_random_kstack_offset(rand) do {				\
	if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,	\
				&randomize_kstack_offset)) {		\
		u32 offset = raw_cpu_read(kstack_offset);		\
		u8 *ptr;						\
									\
		offset = ror32(offset, 5) ^ (rand);			\
		raw_cpu_write(kstack_offset, offset);			\
		u8 *ptr = __kstack_alloca(KSTACK_OFFSET_MAX(offset));	\
		/* Keep allocation even after "ptr" loses scope. */	\
		asm volatile("" :: "r"(ptr) : "memory");		\
	}								\
} while (0)

This ignores "Maximize the timing uncertainty" (but that's ok because the
current version doesn't really do that either), but strengthens "Reduce the
lifetime of the new offset sitting in memory".
Is this assuming that 'rand' can be generated in a non-preemptible
context? If so (and this is non-preemptible), that's fine.

I'm not sure whether that was the intent, or this was ignoring the
rescheduling problem.

If we do this per-task, then that concern disappears, and this can all
be preemptible.

Mark.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help