Re: [PATCH v2 1/2] tracing: Add task_prctl_unknown tracepoint

From: Marco Elver <elver@google.com>
Date: 2024-11-07 15:58:16
Also in: lkml

On Thu, 7 Nov 2024 at 16:54, Mathieu Desnoyers
[off-list ref] wrote:

On 2024-11-07 10:46, Marco Elver wrote:

quoted

On Thu, 7 Nov 2024 at 16:45, Mathieu Desnoyers
[off-list ref] wrote:

quoted

On 2024-11-07 07:25, Marco Elver wrote:

quoted

prctl() is a complex syscall which multiplexes its functionality based
on a large set of PR_* options. Currently we count 64 such options. The
return value of unknown options is -EINVAL, and doesn't distinguish from
known options that were passed invalid args that also return -EINVAL.

To understand if programs are attempting to use prctl() options not yet
available on the running kernel, provide the task_prctl_unknown
tracepoint.

Note, this tracepoint is in an unlikely cold path, and would therefore
be suitable for continuous monitoring (e.g. via perf_event_open).

While the above is likely the simplest usecase, additionally this
tracepoint can help unlock some testing scenarios (where probing
sys_enter or sys_exit causes undesirable performance overheads):

    a. unprivileged triggering of a test module: test modules may register a
       probe to be called back on task_prctl_unknown, and pick a very large
       unknown prctl() option upon which they perform a test function for an
       unprivileged user;

    b. unprivileged triggering of an eBPF program function: similar
       as idea (a).

Example trace_pipe output:

    test-484     [000] .....   631.748104: task_prctl_unknown: comm=test option=1234 arg2=101 arg3=102 arg4=103 arg5=104

My concern is that we start adding tons of special-case
tracepoints to the implementation of system calls which
are redundant with the sys_enter/exit tracepoints.

Why favor this approach rather than hooking on sys_enter/exit ?

It's __extremely__ expensive when deployed at scale. See note in
commit description above.

I suspect you base the overhead analysis on the x86-64 implementation
of sys_enter/exit tracepoint and especially the overhead caused by
the SYSCALL_WORK_SYSCALL_TRACEPOINT thread flag, am I correct ?

If that is causing a too large overhead, we should investigate if
those can be improved instead of adding tracepoints in the
implementation of system calls.

Doing that may be generally useful, but even if you improve it
somehow, there's always some additional bit of work needed on
sys_enter/exit as soon as a tracepoint is attached. Even if that's
just a few cycles, it's too much (for me at least).

Also: if you just hook sys_enter/exit, you don't know if the prctl was
handled or not by inspecting the return code (-EINVAL). I want the
kernel to tell me if it handled the prctl() or not, and I also think
it's very bad design to copy-paste the prctl() option checking of the
running kernel in a sys_enter/exit hook. This doesn't scale in terms
of performance nor maintainability.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help