Re: [RFC PATCH 2/2] softirq: Drop the warning from do_softirq_post_smp_call_flush().
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: 2023-08-16 14:49:58
Also in:
lkml
On 15/08/2023 14.08, Jesper Dangaard Brouer wrote:
On 14/08/2023 11.35, Sebastian Andrzej Siewior wrote:quoted
This is an undesired situation and it has been attempted to avoid the situation in which ksoftirqd becomes scheduled. This changed since commit d15121be74856 ("Revert "softirq: Let ksoftirqd do its job"") and now a threaded interrupt handler will handle soft interrupts at its end even if ksoftirqd is pending. That means that they will be processed in the context in which they were raised.$ git describe --contains d15121be74856 v6.5-rc1~232^2~4 That revert basically removes the "overload" protection that was added to cope with DDoS situations in Aug 2016 (Cc. Cloudflare). As described in https://git.kernel.org/torvalds/c/4cd13c21b207 ("softirq: Let ksoftirqd do its job") in UDP overload situations when UDP socket receiver runs on same CPU as ksoftirqd it "falls-off-an-edge" and almost doesn't process packets (because softirq steals CPU/sched time from UDP pid). Warning Cloudflare (Cc) as this might affect their production use-cases, and I recommend getting involved to evaluate the effect of these changes.
I did some testing on net-next (with commit d15121be74856 ("Revert
"softirq: Let ksoftirqd do its job"") using UDP pktgen + udp_sink.
And I observe the old overload issue occur again, where userspace
process (udp_sink) process very few packets when running on *same* CPU
as the NAPI-RX/IRQ processing. The perf report "comm" clearly shows
that NAPI runs in the context of the "udp_sink" process, stealing its
sched time. (Same CPU around 3Kpps and diff CPU 1722Kpps, see details
below).
What happens are that NAPI takes 64 packets and queue them to the
udp_sink process *socket*, the udp_sink process *wakeup* process 1
packet from socket queue and on exit (__local_bh_enable_ip) runs softirq
that starts NAPI (to again process 64 packets... repeat).
I do realize/acknowledge that the reverted patch caused other latency issues, given it was a "big-hammer" approach affecting other softirq processing (as can be seen by e.g. the watchdog fixes patches). Thus, the revert makes sense, but how to regain the "overload" protection such that RX networking cannot starve processes reading from the socket? (is this what Sebastian's patchset does?)
I'm no expert in sched / softirq area of the kernel, but I'm willing to help out testing different solution that can regain the "overload" protection e.g. avoid packet processing "falls-of-an-edge" (and thus opens the kernel to be DDoS'ed easily). Is this what Sebastian's patchset does?
Thread link for people Cc'ed: https://lore.kernel.org/all/20230814093528.117342-1-bigeasy@linutronix.de/#r (local)
--Jesper (some testlab results below) [udp_sink] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c When udp_sink runs on same CPU and NAPI/softirq - UdpInDatagrams: 2,948 packets/sec $ nstat -n && sleep 1 && nstat #kernel IpInReceives 2831056 0.0 IpInDelivers 2831053 0.0 UdpInDatagrams 2948 0.0 UdpInErrors 2828118 0.0 UdpRcvbufErrors 2828118 0.0 IpExtInOctets 130206496 0.0 IpExtInNoECTPkts 2830576 0.0 When udp_sink runs on another CPU than NAPI-RX. - UdpInDatagrams: 1,722,307 pps $ nstat -n && sleep 1 && nstat #kernel IpInReceives 2318560 0.0 IpInDelivers 2318562 0.0 UdpInDatagrams 1722307 0.0 UdpInErrors 596280 0.0 UdpRcvbufErrors 596280 0.0 IpExtInOctets 106634256 0.0 IpExtInNoECTPkts 2318136 0.0