Re: [PATCH] net: raise RCU qs after each threaded NAPI poll

From: "Paul E. McKenney" <paulmck@kernel.org>
Date: 2024-02-27 22:58:48
Also in: bpf, lkml, rcu

On Tue, Feb 27, 2024 at 03:22:57PM -0600, Yan Zhai wrote:

On Tue, Feb 27, 2024 at 12:32 PM Paul E. McKenney [off-list ref] wrote:

quoted

On Tue, Feb 27, 2024 at 05:44:17PM +0100, Eric Dumazet wrote:

quoted

On Tue, Feb 27, 2024 at 4:44 PM Yan Zhai [off-list ref] wrote:

quoted

We noticed task RCUs being blocked when threaded NAPIs are very busy in
production: detaching any BPF tracing programs, i.e. removing a ftrace
trampoline, will simply block for very long in rcu_tasks_wait_gp. This
ranges from hundreds of seconds to even an hour, severely harming any
observability tools that rely on BPF tracing programs. It can be
easily reproduced locally with following setup:

ip netns add test1
ip netns add test2

ip -n test1 link add veth1 type veth peer name veth2 netns test2

ip -n test1 link set veth1 up
ip -n test1 link set lo up
ip -n test2 link set veth2 up
ip -n test2 link set lo up

ip -n test1 addr add 192.168.1.2/31 dev veth1
ip -n test1 addr add 1.1.1.1/32 dev lo
ip -n test2 addr add 192.168.1.3/31 dev veth2
ip -n test2 addr add 2.2.2.2/31 dev lo

ip -n test1 route add default via 192.168.1.3
ip -n test2 route add default via 192.168.1.2

for i in `seq 10 210`; do
 for j in `seq 10 210`; do
    ip netns exec test2 iptables -I INPUT -s 3.3.$i.$j -p udp --dport 5201
 done
done

ip netns exec test2 ethtool -K veth2 gro on
ip netns exec test2 bash -c 'echo 1 > /sys/class/net/veth2/threaded'
ip netns exec test1 ethtool -K veth1 tso off

Then run an iperf3 client/server and a bpftrace script can trigger it:

ip netns exec test2 iperf3 -s -B 2.2.2.2 >/dev/null&
ip netns exec test1 iperf3 -c 2.2.2.2 -B 1.1.1.1 -u -l 1500 -b 3g -t 100 >/dev/null&
bpftrace -e 'kfunc:__napi_poll{@=count();} interval:s:1{exit();}'

Above reproduce for net-next kernel with following RCU and preempt
configuraitons:

# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem
# RCU Debugging
# CONFIG_RCU_SCALE_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_REF_SCALE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=21
CONFIG_RCU_EXP_CPU_STALL_TIMEOUT=0
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# end of RCU Debugging

CONFIG_PREEMPT_BUILD=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set

An interesting observation is that, while tasks RCUs are blocked,
related NAPI thread is still being scheduled (even across cores)
regularly. Looking at the gp conditions, I am inclining to cond_resched
after each __napi_poll being the problem: cond_resched enters the
scheduler with PREEMPT bit, which does not account as a gp for tasks
RCUs. Meanwhile, since the thread has been frequently resched, the
normal scheduling point (no PREEMPT bit, accounted as a task RCU gp)
seems to have very little chance to kick in. Given the nature of "busy
polling" program, such NAPI thread won't have task->nvcsw or task->on_rq
updated (other gp conditions), the result is that such NAPI thread is
put on RCU holdouts list for indefinitely long time.

This is simply fixed by mirroring the ksoftirqd behavior: after
NAPI/softirq work, raise a RCU QS to help expedite the RCU period. No
more blocking afterwards for the same setup.

Fixes: 29863d41bb6e ("net: implement threaded-able napi poll loop support")
Signed-off-by: Yan Zhai <redacted>
---
 net/core/dev.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 275fd5259a4a..6e41263ff5d3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c

@@ -6773,6 +6773,10 @@ static int napi_threaded_poll(void *data)
                                net_rps_action_and_irq_enable(sd);
                        }
                        skb_defer_free_flush(sd);

Please put a comment here stating that RCU readers cannot cross
this point.

I need to add lockdep to rcu_softirq_qs() to catch placing this in an
RCU read-side critical section.  And a header comment noting that from
an RCU perspective, it acts as a momentary enabling of preemption.

Just to clarify, do you mean I should state that this polling function
can not be called from within an RCU read critical section? Or do you
mean any read critical sections need to end before raising this QS?

Yes to both.

I am preparing a patch to make lockdep complain if you do something
like this:

	rcu_read_lock();
	do_something();
	rcu_softirq_qs();
	do_something_else();
	rcu_read_unlock();

However, it will still be perfectly legal to do something like this:

	local_bh_disable();
	do_something();
	rcu_softirq_qs();  // A
	do_something_else();
	local_bh_enable();  // B

Which might surprise someone expection a synchronize_rcu() to wait
for execution to reach B.  Because of that rcu_softirq_qs(), that
synchronize_rcu() could instead return as soon as execution reached A.

So I am adding this example to a new kernel-doc header for
rcu_softirq_qs().

						Thanx, Paul

Yan

quoted

+                       if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+                               rcu_softirq_qs();
+
                        local_bh_enable();

                        if (!repoll)
--
2.30.2

Hmm....
Why napi_busy_loop() does not have a similar problem ?

It is unclear why rcu_all_qs() in __cond_resched() is guarded by

#ifndef CONFIG_PREEMPT_RCU
     rcu_all_qs();
#endif

The theory is that PREEMPT_RCU kernels have preemption, and get their
quiescent states that way.

The more recent practice involves things like PREEMPT_DYNAMIC and maybe
soon PREEMPT_AUTO, which might require adjustments, so thank you for
pointing this out!

Back on the patch, my main other concern is that someone somewhere might
be using something like synchronize_rcu() to wait for all in-progress
softirq handlers to complete.  But I don't know of such a thing, and if
there is, there are workarounds, including synchronize_rcu_tasks().

So something to be aware of, not (as far as I know) something to block
this commit.

With the added comment:

Acked-by: Paul E. McKenney <paulmck@kernel.org>

                                                        Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help