Re: [PATCH v3 2/3] prandom: Convert prandom_u32_state() to __always_inline
From: Ryan Roberts <ryan.roberts@arm.com>
Date: 2026-01-05 10:36:39
Also in:
linux-arm-kernel, linux-hardening, linux-riscv, linux-s390, lkml, loongarch
On 03/01/2026 08:00, Christophe Leroy (CS GROUP) wrote:
quoted hunk ↗ jump to hunk
Le 02/01/2026 à 15:09, Ryan Roberts a écrit :quoted
On 02/01/2026 13:39, Jason A. Donenfeld wrote:quoted
Hi Ryan, On Fri, Jan 2, 2026 at 2:12 PM Ryan Roberts [off-list ref] wrote:quoted
context. Given the function is just a handful of operations and doesn'tHow many? What's this looking like in terms of assembly?25 instructions on arm64:31 instructions on powerpc: 00000000 <prandom_u32_state>: 0: 7c 69 1b 78 mr r9,r3 4: 80 63 00 00 lwz r3,0(r3) 8: 80 89 00 08 lwz r4,8(r9) c: 81 69 00 04 lwz r11,4(r9) 10: 80 a9 00 0c lwz r5,12(r9) 14: 54 67 30 32 slwi r7,r3,6 18: 7c e7 1a 78 xor r7,r7,r3 1c: 55 66 10 3a slwi r6,r11,2 20: 54 88 68 24 slwi r8,r4,13 24: 54 63 90 18 rlwinm r3,r3,18,0,12 28: 7d 6b 32 78 xor r11,r11,r6 2c: 7d 08 22 78 xor r8,r8,r4 30: 54 aa 18 38 slwi r10,r5,3 34: 54 e7 9b 7e srwi r7,r7,13 38: 7c e7 1a 78 xor r7,r7,r3 3c: 51 66 2e fe rlwimi r6,r11,5,27,31 40: 54 84 38 28 rlwinm r4,r4,7,0,20 44: 7d 4a 2a 78 xor r10,r10,r5 48: 55 08 5d 7e srwi r8,r8,21 4c: 7d 08 22 78 xor r8,r8,r4 50: 7c e3 32 78 xor r3,r7,r6 54: 54 a5 68 16 rlwinm r5,r5,13,0,11 58: 55 4a a3 3e srwi r10,r10,12 5c: 7d 4a 2a 78 xor r10,r10,r5 60: 7c 63 42 78 xor r3,r3,r8 64: 90 e9 00 00 stw r7,0(r9) 68: 90 c9 00 04 stw r6,4(r9) 6c: 91 09 00 08 stw r8,8(r9) 70: 91 49 00 0c stw r10,12(r9) 74: 7c 63 52 78 xor r3,r3,r10 78: 4e 80 00 20 blr Among those, 8 instructions are for reading/writing the state in stack. They of course disappear when inlining.quoted
quoted
It'd also be nice to have some brief analysis of other call sites to have confirmation this isn't blowing up other users.I compiled defconfig before and after this patch on arm64 and compared the text sizes: $ ./scripts/bloat-o-meter -t vmlinux.before vmlinux.after add/remove: 3/4 grow/shrink: 4/1 up/down: 836/-128 (708) Function old new delta prandom_seed_full_state 364 932 +568 pick_next_task_fair 1940 2036 +96 bpf_user_rnd_u32 104 196 +92 prandom_bytes_state 204 260 +56 e843419@0f2b_00012d69_e34 - 8 +8 e843419@0db7_00010ec3_23ec - 8 +8 e843419@02cb_00003767_25c - 8 +8 bpf_prog_select_runtime 448 444 -4 e843419@0aa3_0000cfd1_1580 8 - -8 e843419@0aa2_0000cfba_147c 8 - -8 e843419@075f_00008d8c_184 8 - -8 prandom_u32_state 100 - -100 Total: Before=19078072, After=19078780, chg +0.00% So 708 bytes more after inlining. The main cost is prandom_seed_full_state(), which calls prandom_u32_state() 10 times (via prandom_warmup()). I expect we could turn that into a loop to reduce ~450 bytes overall.With following change the increase of prandom_seed_full_state() remains reasonnable and performance wise it is a lot better as it avoids the read/write of the state via the stackdiff --git a/lib/random32.c b/lib/random32.c index 24e7acd9343f6..28a5b109c9018 100644 --- a/lib/random32.c +++ b/lib/random32.c@@ -94,17 +94,11 @@ EXPORT_SYMBOL(prandom_bytes_state);static void prandom_warmup(struct rnd_state *state) { + int i; + /* Calling RNG ten times to satisfy recurrence condition */ - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); - prandom_u32_state(state); + for (i = 0; i < 10; i++) + prandom_u32_state(state); } void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state) The loop is: 248: 38 e0 00 0a li r7,10 24c: 7c e9 03 a6 mtctr r7 250: 55 05 30 32 slwi r5,r8,6 254: 55 46 68 24 slwi r6,r10,13 258: 55 27 18 38 slwi r7,r9,3 25c: 7c a5 42 78 xor r5,r5,r8 260: 7c c6 52 78 xor r6,r6,r10 264: 7c e7 4a 78 xor r7,r7,r9 268: 54 8b 10 3a slwi r11,r4,2 26c: 7d 60 22 78 xor r0,r11,r4 270: 54 a5 9b 7e srwi r5,r5,13 274: 55 08 90 18 rlwinm r8,r8,18,0,12 278: 54 c6 5d 7e srwi r6,r6,21 27c: 55 4a 38 28 rlwinm r10,r10,7,0,20 280: 54 e7 a3 3e srwi r7,r7,12 284: 55 29 68 16 rlwinm r9,r9,13,0,11 288: 7d 64 5b 78 mr r4,r11 28c: 7c a8 42 78 xor r8,r5,r8 290: 7c ca 52 78 xor r10,r6,r10 294: 7c e9 4a 78 xor r9,r7,r9 298: 50 04 2e fe rlwimi r4,r0,5,27,31 29c: 42 00 ff b4 bdnz 250 <prandom_seed_full_state+0x7c> Which replaces the 10 calls to prandom_u32_state() fc: 91 3f 00 0c stw r9,12(r31) 100: 7f e3 fb 78 mr r3,r31 104: 48 00 00 01 bl 104 <prandom_seed_full_state+0x88> 104: R_PPC_REL24 prandom_u32_state 108: 7f e3 fb 78 mr r3,r31 10c: 48 00 00 01 bl 10c <prandom_seed_full_state+0x90> 10c: R_PPC_REL24 prandom_u32_state 110: 7f e3 fb 78 mr r3,r31 114: 48 00 00 01 bl 114 <prandom_seed_full_state+0x98> 114: R_PPC_REL24 prandom_u32_state 118: 7f e3 fb 78 mr r3,r31 11c: 48 00 00 01 bl 11c <prandom_seed_full_state+0xa0> 11c: R_PPC_REL24 prandom_u32_state 120: 7f e3 fb 78 mr r3,r31 124: 48 00 00 01 bl 124 <prandom_seed_full_state+0xa8> 124: R_PPC_REL24 prandom_u32_state 128: 7f e3 fb 78 mr r3,r31 12c: 48 00 00 01 bl 12c <prandom_seed_full_state+0xb0> 12c: R_PPC_REL24 prandom_u32_state 130: 7f e3 fb 78 mr r3,r31 134: 48 00 00 01 bl 134 <prandom_seed_full_state+0xb8> 134: R_PPC_REL24 prandom_u32_state 138: 7f e3 fb 78 mr r3,r31 13c: 48 00 00 01 bl 13c <prandom_seed_full_state+0xc0> 13c: R_PPC_REL24 prandom_u32_state 140: 7f e3 fb 78 mr r3,r31 144: 48 00 00 01 bl 144 <prandom_seed_full_state+0xc8> 144: R_PPC_REL24 prandom_u32_state 148: 80 01 00 24 lwz r0,36(r1) 14c: 7f e3 fb 78 mr r3,r31 150: 83 e1 00 1c lwz r31,28(r1) 154: 7c 08 03 a6 mtlr r0 158: 38 21 00 20 addi r1,r1,32 15c: 48 00 00 00 b 15c <prandom_seed_full_state+0xe0> 15c: R_PPC_REL24 prandom_u32_state So approx the same number of instructions in size, while better performance.quoted
I'm not really sure if 708 is good or bad...That's in the noise compared to the overall size of vmlinux, but if we change it to a loop we also reduce pressure on the cache.
Thanks for the analysis; I'm going to follow David's suggestion and refactor this into both an __always_inline and an out-of-line version. That way the existing callsites can continue to use the out-of-line version and we will only use the inline version for the kstack offset randomization. Thanks, Ryan
Christophe