Re: [PATCH] tracing/timerlat: Check tlat_var for NULL in timerlat_fd_release

From: Luis Claudio R. Goncalves <hidden>
Date: 2024-08-30 00:31:23
Also in: lkml

On Thu, Aug 29, 2024 at 08:40:25PM -0300, Luis Claudio R. Goncalves wrote:

On Tue, Aug 27, 2024 at 04:34:39PM +0200, Tomas Glozar wrote:

quoted

po 26. 8. 2024 v 19:27 odesílatel Steven Rostedt [off-list ref] napsal:

quoted

Yeah, I think I finally found the real issue. I don't think we need the ref
counting. The problem is the creating and killing of the threads via the
start and stop callbacks. That's not their purpose. The purpose of stop
and start callbacks is when tracing_on is set to off and back on again. I
think this is what is racing with the close.

I believe I discovered the exact mechanism of the bug. Since rtla is
killed with SIGTERM, it leaves the timerlat tracer active and its
timerlatu processes running. The rtla from the next iteration of the
while loop in the reproducer deactivates the timerlat tracer,
triggering the killing of the timerlatu processes. However, this
killing is asynchronous, and while the processes are gradually dying
one by one and closing their fds, the rtla from the next iteration is
already activating the timerlat tracer again. One of the timerlatu
processes does not die fast enough to release the fd before the
tlat_var is cleared in osnoise_workload_start, triggering the panic.

With this in mind I tried a simpler approach. I used a kernel without any
extra patches from Steven or ourselves, just used the code below. Not
saying it is a definitive solution, but as a proof of concept I have two
boxes running the loop used to reproduce the problem and so far not a
single problem. The patch:

Premature commemoration. One of the boxes rebooted a minute ago. Sounds
like I just diminished a bit the window for triggering the problem.
Tomorrow I will run tests with all the suggested patches and work of a
better idea, more comprehensive, to address what Tomas described above.

quoted hunk ↗ jump to hunk

diff --git a/kernel/trace/trace_osnoise.c b/kernel/trace/trace_osnoise.c
index 66a871553d4a1..331c69bc7b0d4 100644
--- a/kernel/trace/trace_osnoise.c
+++ b/kernel/trace/trace_osnoise.c

@@ -2239,8 +2239,11 @@ static ssize_t osnoise_options_write(struct file *filp, const char __user *ubuf,
 	 */
 	mutex_lock(&trace_types_lock);
 	running = osnoise_has_registered_instances();
-	if (running)
+	if (running) {
 		stop_per_cpu_kthreads();
+		mutex_unlock(&trace_types_lock);
+		return -EBUSY;
+	}
 
 	mutex_lock(&interface_lock);
 	/*

@@ -2579,7 +2582,8 @@ static int timerlat_fd_release(struct inode *inode, struct file *file)
 	osn_var = per_cpu_ptr(&per_cpu_osnoise_var, cpu);
 	tlat_var = per_cpu_ptr(&per_cpu_timerlat_var, cpu);
 
-	hrtimer_cancel(&tlat_var->timer);
+	if (tlat_var->kthread)
+		hrtimer_cancel(&tlat_var->timer);
 	memset(tlat_var, 0, sizeof(*tlat_var));
 
 	osn_var->sampling = 0;
---

In short, do not start new instances if the old one still registered. And
only cancel the timer if the timerlat per cpu var has not been zeroed-out.

Again, not a final patch but a simpler approach to protect start/stop
races.

Luis

quoted

You can see the bug in action in this detailed trace I got today, with
the use of in-kernel trace_printk together with bpftrace and
bpf_trace_printk and captured using ftrace_dump_on_oops (running the
same reproducer on a 8 CPU KVM system):
[  103.227679]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=0,
tlat_var=0000000034996efa
[  103.227844]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=1,
tlat_var=0000000062d1f520
[  103.228009]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=2,
tlat_var=0000000002e262d8
[  103.228173]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=3,
tlat_var=00000000b9d3a280
[  103.228349]     rtla-2325      2..... 98242607us :
osnoise_workload_start: tlat_var_reset: cpu=4,
tlat_var=000000000fba7a6e
[  103.228510]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=5,
tlat_var=000000008ed0b87f
[  103.228670]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=6,
tlat_var=00000000a02fe267
[  103.228831]     rtla-2325      2..... 98242608us :
osnoise_workload_start: tlat_var_reset: cpu=7,
tlat_var=000000008e1d5095
[  103.228992]     rtla-2327      3...11 98256551us :
bpf_trace_printk: rtla pid 2325 forks into 2328
[  103.229156]     rtla-2327      3...11 98256873us :
bpf_trace_printk: rtla pid 2325 forks into 2329
[  103.229325]     rtla-2327      3...11 98257124us :
bpf_trace_printk: rtla pid 2325 forks into 2330
[  103.229489]     rtla-2327      3...11 98257353us :
bpf_trace_printk: rtla pid 2325 forks into 2331
[  103.229651]     rtla-2327      3...11 98257574us :
bpf_trace_printk: rtla pid 2325 forks into 2332
[  103.229810]     rtla-2327      3...11 98257801us :
bpf_trace_printk: rtla pid 2325 forks into 2333
[  103.229968]     rtla-2327      3...11 98258008us :
bpf_trace_printk: rtla pid 2325 forks into 2334
[  103.230124]     rtla-2327      3...11 98258216us :
bpf_trace_printk: rtla pid 2325 forks into 2335
[  103.230293]     bash-2094      0d..31 102987914us :
bpf_trace_printk: rtla pid 2325 received signal 2
[  103.230452]     bash-2094      0d..31 102987915us :
bpf_trace_printk: ...from bash pid 2094
[  103.230609]     bash-2094      0d..31 102990000us :
bpf_trace_printk: rtla pid 2325 received signal 15
[  103.230766]     bash-2094      0d..31 102990001us :
bpf_trace_printk: ...from bash pid 2094
[  103.230922]     rtla-2325      2...21 102990010us :
bpf_trace_printk: rtla pid 2325 exiting
[  103.231079]     rtla-2327      3...21 102990503us :
bpf_trace_printk: rtla pid 2327 exiting
[  103.231242]     rtla-2337      3d..31 103232021us :
bpf_trace_printk: timerlatu/0 pid 2328 received signal 9
[  103.231399]     rtla-2337      3d..31 103232022us :
bpf_trace_printk: ...from rtla pid 2337
[  103.231554]     rtla-2337      3d..31 103232024us :
bpf_trace_printk: timerlatu/1 pid 2329 received signal 9
[  103.231711]     rtla-2337      3d..31 103232025us :
bpf_trace_printk: ...from rtla pid 2337
[  103.231866]     rtla-2337      3d..31 103232029us :
bpf_trace_printk: timerlatu/2 pid 2330 received signal 9
[  103.232022]     rtla-2337      3d..31 103232030us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232178]     rtla-2337      3dN.31 103232031us :
bpf_trace_printk: timerlatu/3 pid 2331 received signal 9
[  103.232347]     rtla-2337      3dN.31 103232031us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232506]     rtla-2337      3dN.31 103232034us :
bpf_trace_printk: timerlatu/4 pid 2332 received signal 9
[  103.232685]     rtla-2337      3dN.31 103232034us :
bpf_trace_printk: ...from rtla pid 2337
[  103.232846]     rtla-2337      3dN.31 103232039us :
bpf_trace_printk: timerlatu/5 pid 2333 received signal 9
[  103.233010]     rtla-2337      3dN.31 103232039us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233171]     rtla-2337      3dN.31 103232042us :
bpf_trace_printk: timerlatu/6 pid 2334 received signal 9
[  103.233357]     rtla-2337      3dN.31 103232042us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233517]     rtla-2337      3dN.31 103232046us :
bpf_trace_printk: timerlatu/7 pid 2335 received signal 9
[  103.233678]     rtla-2337      3dN.31 103232046us :
bpf_trace_printk: ...from rtla pid 2337
[  103.233835] timerlat-2331      3...21 103232701us :
bpf_trace_printk: timerlatu/3 pid 2331 exiting
[  103.233995] timerlat-2329      1...21 103232704us :
bpf_trace_printk: timerlatu/1 pid 2329 exiting
[  103.234155] timerlat-2331      3....1 103232705us :
timerlat_fd_release: timerlat_fd_release: cpu=3,
tlat_var=000000000fba7a6e, osn_var->pid=2331
[  103.234357] timerlat-2329      1....1 103232706us :
timerlat_fd_release: timerlat_fd_release: cpu=1,
tlat_var=0000000002e262d8, osn_var->pid=2329
[  103.234508] timerlat-2334      6...21 103232821us :
bpf_trace_printk: timerlatu/6 pid 2334 exiting
[  103.234657] timerlat-2334      6....1 103232826us :
timerlat_fd_release: timerlat_fd_release: cpu=6,
tlat_var=000000008e1d5095, osn_var->pid=2334
[  103.234813] timerlat-2330      2...21 103232850us :
bpf_trace_printk: timerlatu/2 pid 2330 exiting
[  103.234966] timerlat-2330      2....1 103232853us :
timerlat_fd_release: timerlat_fd_release: cpu=2,
tlat_var=00000000b9d3a280, osn_var->pid=2330
[  103.235122] timerlat-2333      5...21 103232856us :
bpf_trace_printk: timerlatu/5 pid 2333 exiting
[  103.235291] timerlat-2333      5....1 103232858us :
timerlat_fd_release: timerlat_fd_release: cpu=5,
tlat_var=00000000a02fe267, osn_var->pid=2333
[  103.235452] timerlat-2332      4...21 103232861us :
bpf_trace_printk: timerlatu/4 pid 2332 exiting
[  103.235610] timerlat-2332      4....1 103232864us :
timerlat_fd_release: timerlat_fd_release: cpu=4,
tlat_var=000000008ed0b87f, osn_var->pid=2332
[  103.235769] timerlat-2335      7...21 103232867us :
bpf_trace_printk: timerlatu/7 pid 2335 exiting
[  103.235932] timerlat-2335      7....1 103232869us :
timerlat_fd_release: timerlat_fd_release: cpu=7,
tlat_var=000000003113e297, osn_var->pid=2335
[  103.236103]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=0,
tlat_var=0000000034996efa
[  103.236316]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=1,
tlat_var=0000000062d1f520
[  103.236473]     rtla-2337      3..... 103234109us :
osnoise_workload_start: tlat_var_reset: cpu=2,
tlat_var=0000000002e262d8
[  103.236630]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=3,
tlat_var=00000000b9d3a280
[  103.236797]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=4,
tlat_var=000000000fba7a6e
[  103.236972]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=5,
tlat_var=000000008ed0b87f
[  103.237144]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=6,
tlat_var=00000000a02fe267
[  103.237345]     rtla-2337      3..... 103234110us :
osnoise_workload_start: tlat_var_reset: cpu=7,
tlat_var=000000008e1d5095
[  103.237516] timerlat-2328      0...21 103234149us :
bpf_trace_printk: timerlatu/0 pid 2328 exiting
[  103.237690] timerlat-2328      0....1 103234152us :
timerlat_fd_release: timerlat_fd_release: cpu=0,
tlat_var=0000000062d1f520, osn_var->pid=0
[  103.237871] ---------------------------------
[  103.238053] CR2: 0000000000000010

On the kernel side, I'm using similar trace_printk debug prints as you
did; the bpftrace program used is this:
uretprobe:libc:fork
/strncmp(comm, "rtla", 5) == 0 && retval != 0/
{
        debugf("rtla pid %d forks into %d", pid, retval)
}
tracepoint:signal:signal_generate
/strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
"timerlatu", 9) == 0/ {
        debugf("%s pid %d received signal %d", args->comm, args->pid,
args->sig);
        debugf("...from %s pid %d", comm, pid);
}
tracepoint:sched:sched_process_exit
/strncmp(args->comm, "rtla", 5) == 0 || strncmp(args->comm,
"timerlatu", 9) == 0/ {
        debugf("%s pid %d exiting", args->comm, args->pid)
}

quoted

Anyway, the start and stop should probably just pause the threads and not
kill them an start them again. That is, the osnoise_workload_start() should
be called by the init callbacks and the osnoise_workload_stop should be
called by reset callback.

The start and stop callbacks should just pause and restart the the threads.

I'm not sure if that fits into the design of timerlat. At least for
timerlat with user workload, each activation of the timerlat tracer
implies the creation of a new user workload, not the re-activation of
a lingering one (that might be an entirely different program). A
timerlat tracer is tightly associated with its threads: if the threads
exit, the timer itself ceases to have meaning. Of course also rtla
should make sure there are no lingering timerlatu processes.

I think waiting on the threads to actually exit in stop_kthread() is
the proper solution:

/*
 * stop_kthread - stop a workload thread
 */
static void stop_kthread(unsigned int cpu){
    struct task_struct *kthread;

    kthread = per_cpu(per_cpu_osnoise_var, cpu).kthread;
    if (kthread) {
        if (test_bit(OSN_WORKLOAD, &osnoise_options)) {
            kthread_stop(kthread);
        } else {
            /*
             * This is a user thread waiting on the timerlat_fd. We need
             * to close all users, and the best way to guarantee this is
             * by killing the thread. NOTE: this is a purpose specific file.
             */
            kill_pid(kthread->thread_pid, SIGKILL, 1);
            /* ^^ here wait until kthread (actually the user workload) exits */
            put_task_struct(kthread);
        }
        per_cpu(per_cpu_osnoise_var, cpu).kthread = NULL;
    }

There might even be another bug, not sure if calling put_task_struct
on a task that is still exiting is sane.

Tomas

---end quoted text---

---end quoted text---

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help