Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on... | bpf

Re: BPF trampolines break because of hang in synchronize_rcu_tasks() on PREEMPT kernels

From: Toke Høiland-Jørgensen <hidden>
Date: 2021-03-25 21:14:18

"Paul E. McKenney" [off-list ref] writes:

On Wed, Mar 24, 2021 at 08:17:35PM +0100, Toke Høiland-Jørgensen wrote:

quoted

"Paul E. McKenney" [off-list ref] writes:

quoted

On Wed, Mar 24, 2021 at 12:33:47PM +0100, Toke Høiland-Jørgensen wrote:

quoted

"Paul E. McKenney" [off-list ref] writes:

quoted

On Tue, Mar 23, 2021 at 11:06:04PM +0100, Toke Høiland-Jørgensen wrote:

quoted

"Paul E. McKenney" [off-list ref] writes:

quoted

On Tue, Mar 23, 2021 at 10:04:50PM +0100, Toke Høiland-Jørgensen wrote:

quoted

Andrii Nakryiko [off-list ref] writes:

quoted

On Tue, Mar 23, 2021 at 12:52 PM Toke Høiland-Jørgensen [off-list ref] wrote:

quoted

"Paul E. McKenney" [off-list ref] writes:

quoted

On Tue, Mar 23, 2021 at 06:29:35PM +0100, Toke Høiland-Jørgensen wrote:

quoted

"Paul E. McKenney" [off-list ref] writes:

quoted

On Tue, Mar 23, 2021 at 01:26:36PM +0100, Toke Høiland-Jørgensen wrote:

quoted

Hi Paul

Magnus and I have been debugging an issue where close() on a bpf_link
file descriptor would hang indefinitely when the system was under load
on a kernel compiled with CONFIG_PREEMPT=y, and it seems to be related
to synchronize_rcu_tasks(), so I'm hoping you can help us with it.

The issue is triggered reliably by loading up a system with network
traffic (causing 100% softirq CPU load on one or more cores), and then
attaching an freplace bpf_link and closing it again. The close() will
hang until the network traffic load is lowered.

Digging further, it appears that the hang happens in
synchronize_rcu_tasks(), as seen by running a bpftrace script like:

bpftrace -e 'kprobe:synchronize_rcu_tasks { @start = nsecs; printf("enter\n"); } kretprobe:synchronize_rcu_tasks { printf("exit after %d ms\n", (nsecs - @start) / 1000000); }'
Attaching 2 probes...
enter
exit after 54 ms
enter
exit after 3249 ms

(the two enter/exit pairs are, respectively, from an unloaded system,
and from a loaded system where I stopped the network traffic after a
couple of seconds).

The call to synchronize_rcu_tasks() happens in bpf_trampoline_put():

https://elixir.bootlin.com/linux/latest/source/kernel/bpf/trampoline.c#L376

And because it does this while holding trampoline_mutex, even deferring
the put to a worker (as a previously applied-then-reverted patch did[0])
doesn't help: that'll fix the initial hang on close(), but any
subsequent use of BPF trampolines will then be blocked because of the
mutex.

Also, if I just keep the network traffic running I will eventually get a
kernel panic with:

kernel:[44348.426312] Kernel panic - not syncing: hung_task: blocked tasks

I've created a reproducer for the issue here:
https://github.com/xdp-project/bpf-examples/tree/master/bpf-link-hang

To compile simply do this (needs a recent llvm/clang for compiling the BPF program):

$ git clone --recurse-submodules https://github.com/xdp-project/bpf-examples
$ cd bpf-examples/bpf-link-hang
$ make
$ ./sudo bpf-link-hang

you'll need to load up the system to trigger the hang; I'm using pktgen
from a separate machine to do this.

My question is, of course, as ever, What Is To Be Done? Is it expected
that synchronize_rcu_tasks() can hang indefinitely on a PREEMPT system,
or can this be fixed? And if it is expected, how can the BPF code be
fixed so it doesn't deadlock because of this?

Hoping you can help us with this - many thanks in advance! :)

Let me start with the usual question...  Is the network traffic intense
enough that one of the CPUs might remain in a loop handling softirqs
indefinitely?

Yup, I'm pegging all CPUs in softirq:

$ mpstat -P ALL 1
[...]
18:26:52     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
18:26:53     all    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       0    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       1    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       2    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       3    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       4    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00
18:26:53       5    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00    0.00

quoted

If so, does the (untested, probably does not build) patch below help?

Doesn't appear to, no. It builds fine, but I still get:

Attaching 2 probes...
enter
exit after 8480 ms

(that was me interrupting the network traffic again)

Is your kernel properly shifting from back-of-interrupt softirq processing
to ksoftirqd under heavy load?  If not, my patch will not have any
effect.

Seems to be - this is from top:

     12 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/0
     24 root      20   0       0      0      0 R  99.3   0.0   0:43.62 ksoftirqd/2
     34 root      20   0       0      0      0 R  99.3   0.0   0:43.64 ksoftirqd/4
     39 root      20   0       0      0      0 R  99.3   0.0   0:43.65 ksoftirqd/5
     19 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/1
     29 root      20   0       0      0      0 R  99.0   0.0   0:43.63 ksoftirqd/3

Any other ideas? :)

bpf_trampoline_put() got significantly changed by e21aa341785c ("bpf:
Fix fexit trampoline. "), it doesn't do synchronize_rcu_tasks()
anymore. Please give it a try. It's in bpf tree.

Ah! I had missed that patch, and only tested this on bpf-next. Yes, that
indeed works better; awesome!

And sorry for bothering you with this, Paul; guess I should have looked
harder for fixes first... :/

Glad it is now working!

And in any case, my patch needed an s/true/false/.  :-/

Hey, I did say "untested"!  ;-)

Haha, right, well at least you run afoul of the 'truth in advertising'
committee ;)

If you get a chance, could you please test the (hopefully) corrected
patch shown below?  This issue might affect other use cases.

Yup, that does seem to help:

Attaching 2 probes...
enter
exit after 136 ms

Thank you very much!  May I please apply your Tested-by?

Sure!

Tested-by: Toke Høiland-Jørgensen <redacted>

Applied, and thank you!

Awesome! You're welcome, and thank you for the fix (and the quick turnaround)! :)

-Toke

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help