Thread (36 messages) 36 messages, 6 authors, 3d ago

Re: [PATCHv3 04/12] uprobes/x86: Move optimized uprobe from nop5 to nop10

From: Jiri Olsa <hidden>
Date: 2026-05-22 21:19:13
Also in: bpf

On Thu, May 21, 2026 at 03:35:48PM +0200, Peter Zijlstra wrote:
On Thu, May 21, 2026 at 02:44:03PM +0200, Jiri Olsa wrote:
quoted
Andrii reported an issue with optimized uprobes [1] that can clobber
redzone area with call instruction storing return address on stack
where user code may keep temporary data without adjusting rsp.

Fixing this by moving the optimized uprobes on top of 10-bytes nop
instruction, so we can squeeze another instruction to escape the
redzone area before doing the call, like:

  lea -0x80(%rsp), %rsp
  call tramp

Note the lea instruction is used to adjust the rsp register without
changing the flags.

We use nop10 and following transofrmation to optimized instructions
above and back as suggested by Peterz [2].

Optimize path (int3_update_optimize):

  1) Initial state after set_swbp() installed the uprobe:
      cc 2e 0f 1f 84 00 00 00 00 00

     From offset 0 this is INT3 followed by the tail of the original
     10-byte NOP.

  2) Trap the call slot before rewriting the NOP tail:
      cc 2e 0f 1f 84 [cc] 00 00 00 00

     From offset 0 this traps on the uprobe INT3.  A thread reaching
     offset 5 traps on the temporary INT3 instead of seeing a partially
     patched call.

  3) Rewrite the LEA tail and call displacement, keeping both INT3 bytes:
      cc [8d 64 24 80] cc [d0 d1 d2 d3]

     From offset 0 and offset 5 this still traps.  The bytes between
     them are not executable entry points while both traps are in place.

  4) Restore the call opcode at offset 5:
      cc 8d 64 24 80 [e8] d0 d1 d2 d3

     From offset 0 this still traps.  From offset 5 the instruction is
     the final CALL to the uprobe trampoline.

  5) Publish the first LEA byte:
      [48] 8d 64 24 80 e8 d0 d1 d2 d3

     From offset 0 this is:
        lea -0x80(%rsp), %rsp
        call <uprobe-trampoline>

Unoptimize path (int3_update_unoptimize):

  1) Initial optimized state:
      48 8d 64 24 80 e8 d0 d1 d2 d3
     Same as 5) above.

  2) Trap new entries before restoring the NOP bytes:
      [cc] 8d 64 24 80 e8 d0 d1 d2 d3

     From offset 0 this traps. A thread that had already executed the
     LEA can still reach the intact CALL at offset 5.

  3) Restore bytes 1..4 of the original NOP while keeping byte 0 trapped
     and byte 5 as CALL.
      cc [2e 0f 1f 84] e8 d0 d1 d2 d3

     From offset 0 this still traps. Offset 5 is still the CALL for any
     thread that was already past the first LEA byte.

  4) Publish the first byte of the original NOP:
      [66] 2e 0f 1f 84 e8 d0 d1 d2 d3

     From offset 0 this is the restored 10-byte NOP; the CALL opcode and
     displacement are now only NOP operands.  Offset 5 still decodes as
     CALL for a thread that was already there.

Note as explained in [2] we need to use following nop10:
       PF1   PF2   ESC   NOPL  MOD   SIB   DISP32
NOP10: 0x66, 0x2e, 0x0f, 0x1f, 0x84, 0x00, 0x00, 0x00, 0x00, 0x00 -- cs nopw 0x00000000(%rax,%rax,1)

which means we need to allow 0x2e prefix which maps to INAT_PFX_CS
attribute in is_prefix_bad function.

The optimized uprobe performance stays the same:

        uprobe-nop     :    3.129 ± 0.013M/s
        uprobe-push    :    3.045 ± 0.006M/s
        uprobe-ret     :    1.095 ± 0.004M/s
  -->   uprobe-nop10   :    7.170 ± 0.020M/s
        uretprobe-nop  :    2.143 ± 0.021M/s
        uretprobe-push :    2.090 ± 0.000M/s
        uretprobe-ret  :    0.942 ± 0.000M/s
  -->   uretprobe-nop10:    3.381 ± 0.003M/s
        usdt-nop       :    3.245 ± 0.004M/s
  -->   usdt-nop10     :    7.256 ± 0.023M/s
quoted
@@ -893,48 +918,134 @@ static int verify_insn(struct page *page, unsigned long vaddr, uprobe_opcode_t *
 }
 
 /*
+ * Modify the optimized instruction by using INT3 breakpoints on SMP.
  * We completely avoid using stop_machine() here, and achieve the
  * synchronization using INT3 breakpoints and SMP cross-calls.
  * (borrowed comment from smp_text_poke_batch_finish)
  *
+ * The way it is done for optimization (int3_update_optimize):
+ *   1) Start with the uprobe INT3 trap already installed
+ *   2) Add an INT3 trap to the call slot
+ *   3) Update everything but the first byte and the call opcode
+ *   4) Replace the call slot INT3 by the call opcode
+ *   5) Replace the first INT3 by the first byte of the LEA instruction
+ *
+ * The way it is done for unoptimization (int3_update_unoptimize):
+ *   1) Start with the optimized uprobe lea/call instructions
+ *   2) Add an INT3 trap to the address that will be patched
+ *   3) Restore the NOP bytes before the call opcode
+ *   4) Replace the first INT3 by the first byte of the NOP instruction
+ *
+ * Note that unoptimization deliberately keeps the call opcode and displacement
+ * in bytes 5..9. Those bytes become operands of the restored 10-byte NOP.
  */
One important thing to note is that (as earlier noted by Andrii) the
CALL address is never changed. A new optimization pass will not change
the CALL instruction again.

If you noted this anywhere, I failed to find it. This is crucially
important for the correctness of the scheme and should not be emitted.

That is, please add something like:

  "Since there is only a single uprobe-trampoline, the CALL instruction
  will not be changed across unoptimization/optimization cycles.
  Therefore, any task that is preempted at the CALL instruction is
  guaranteed to observe that CALL and not anything else."
nope I did not mention it, will add

thanks,
jirka
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help