Thread (22 messages) 22 messages, 5 authors, 2025-07-23

Re: [RFC] New codectl(2) system call for sframe registration

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Date: 2025-07-23 14:32:43
Also in: bpf, lkml

On 2025-07-23 04:16, Indu Bhagat wrote:
On 7/22/25 11:49 AM, Mathieu Desnoyers wrote:
quoted
On 2025-07-22 14:21, Indu Bhagat wrote:
quoted
On 7/21/25 8:20 AM, Mathieu Desnoyers wrote:
quoted
Hi!

I've written up an RFC for a new system call to handle sframe 
registration
for shared libraries. There has been interest to cover both sframe in
the short term, but also JIT use-cases in the long term, so I'm
covering both here in this RFC to provide the full context. 
Implementation
wise we could start by only covering the sframe use-case.

I've called it "codectl(2)" for now, but I'm of course open to 
feedback.

For ELF, I'm including the optional pathname, build id, and debug link
information which are really useful to translate from instruction 
pointers
to executable/library name, symbol, offset, source file, line number.
This is what we are using in LTTng-UST and Babeltrace debug-info filter
plugin [1], and I think this would be relevant for kernel tracers as 
well
so they can make the resulting stack traces meaningful to users.

sys_codectl(2)
=================

* arg0: unsigned int @option:

/* Additional labels can be added to enum code_opt, for 
extensibility. */

enum code_opt {
     CODE_REGISTER_ELF,
     CODE_REGISTER_JIT,
     CODE_UNREGISTER,
};

* arg1: void * @info

/* if (@option == CODE_REGISTER_ELF) */

/*
  * text_start, text_end, sframe_start, sframe_end allow unwinding 
of the
  * call stack.
  *
  * elf_start, elf_end, pathname, and either build_id or debug_link 
allows
  * mapping instruction pointers to file, symbol, offset, and source 
file
  * location.
  */
struct code_elf_info {
:   __u64 elf_start;
     __u64 elf_end;
What are the elf_start , elf_end intended for ?
The intent is to know at which address the first loadable segment of
the shared object is mapped (elf_start), and the size of the shared
object mapping, which is the sum of the size of its PT_LOAD segments.

This allows tooling to easily lookup which addresses belong to that
shared object, for any loaded segment, whether it's code or data.
quoted
quoted
     __u64 text_start;
     __u64 text_end;
     __u64 sframe_start;
     __u64 sframe_end;
     __u64 pathname;              /* char *, NULL if unavailable. */

     __u64 build_id;              /* char *, NULL if unavailable. */
     __u64 debug_link_pathname;   /* char *, NULL if unavailable. */
     __u32 build_id_len;
     __u32 debug_link_crc;
};


/* if (@option == CODE_REGISTER_JIT) */

/*
  * Registration of sorted JIT unwind table: The reserved memory 
area is
  * of size reserved_len. Userspace increases used_len as new code is
  * populated between text_start and text_end. This area is 
populated in
  * increasing address order, and its ABI requires to have no 
overlapping
  * fre. This fits the common use-case where JITs populate code into
  * a given memory area by increasing address order. The sorted unwind
  * tables can be chained with a singly-linked list as they become 
full.
  * Consecutive chained tables are also in sorted text address order.
  *
  * Note: if there is an eventual use-case for unsorted jit unwind 
table,
  * this would be introduced as a new "code option".
  */

struct code_jit_info {
     __u64 text_start;      /* text_start >= addr */
     __u64 text_end;        /* addr < text_end */
     __u64 unwind_head;     /* struct code_jit_unwind_table * */
};
I see the discussion has evolved here with the general sentiment that 
the JIT part needs to be kept in mind for a rough sketch but cannot 
be designed at this time. But two comments (if we keep JIT part in 
the discussion):
   - I think we need to keep __u64 unwind_head not a pointer to a 
defined structure (struct code_jit_unwind_table * above), but some 
opaque type like we have for SFrame case.
What is the reason for making this an opaque type for sframe ?
So that the system call only does the work of registering the memory of 
specific size as stack trace data for addr range (text_start, text_end). 
  IIUC, in the current proposal, the format of the stack trace 
information is exposed in the arg. So when the format evolves, this will 
mean additional management via some flags?
There are various way to handle extensions here. The simplest
would be to add a new label to enum code_opt and register the extended
JIT abi as a new option. But this would likely involve a lot of
duplication if the goal is just to add one more field to struct
code_jit_unwind_fre.

I suspect that the unwind table and linked list of unwind tables is
something we won't want to change for this ABI. What I see could be
a relevant extension point is struct code_jit_unwind_fre, but given
that it will be placed into an array, making it extensible requires
some care: we'd need to keep track of its stride. We could do it like
this:

struct code_jit_unwind_table {
     __u64 reserved_len;
     __u64 used_len; /*
                      * Incremented by userspace (store-release), read by
                      * the kernel (load-acquire).
                      */
     __u64 next;     /* Chain with next struct code_jit_unwind_table. */
     __u32 fre_stride; /* Stride of fre array (includes padding). */
     __u32 fre_size;   /* Offset at end of last used field. */
     char fre[];
};

So extending struct code_jit_unwind_fre could be done by adding fields
at the end, thus potentially increasing its size or turning padding into
used fields. fre_size would keep track of the "used" fields.

I'm open to extend by size (with fre_size) or using flags, whatever fits
best.

Thanks,

Mathieu
quoted
quoted
   - The reserved_len should ideally be a part of code_jit_info, so 
the length can be known without parsing the contents.
I've placed reserved_len within the unwind table because I planned to
have the jit information for a given range of text be a linked list of
tables. Therefore, if one table fills up, then another table can be
chained at the tail. Having the reserved_len part of each table makes
things easier to combine into a linked list.

Thanks for your feedback !

Mathieu
quoted
quoted
struct code_jit_unwind_fre {
     /*
      * Contains info similar to sframe, allowing unwind for a given
      * code address range.
      */
     __u32 size;
     __u32 ip_off;  /* offset from text_start */
     __s32 cfa_off;
     __s32 ra_off;
     __s32 fp_off;
     __u8 info;
};

struct code_jit_unwind_table {
     __u64 reserved_len;
     __u64 used_len; /*
                      * Incremented by userspace (store-release), 
read by
                      * the kernel (load-acquire).
                      */
     __u64 next;     /* Chain with next struct 
code_jit_unwind_table. */
     struct code_jit_unwind_fre fre[];
};

/* if (@option == CODE_UNREGISTER) */

void *info

* arg2: size_t info_size

/*
  * Size of @info structure, allowing extensibility. See
  * copy_struct_from_user().
  */

* arg3: unsigned int flags (0)

/* Flags for extensibility. */

Your feedback is welcome,

Thanks,

Mathieu

[1] https://babeltrace.org/docs/v2.0/man7/babeltrace2-filter.lttng- 
utils.debug-info.7/

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help