Thread (17 messages) 17 messages, 3 authors, 2025-08-30

Re: [PATCH net-next v8 0/2] Add support to do threaded napi busy poll

From: Samiullah Khawaja <hidden>
Date: 2025-08-29 22:27:25

On Fri, Aug 29, 2025 at 2:27 PM Martin Karsten [off-list ref] wrote:
On 2025-08-29 16:49, Samiullah Khawaja wrote:
quoted
On Fri, Aug 29, 2025 at 11:08 AM Martin Karsten [off-list ref] wrote:
quoted
On 2025-08-29 13:50, Samiullah Khawaja wrote:
quoted
On Thu, Aug 28, 2025 at 8:15 PM Martin Karsten [off-list ref] wrote:
quoted
On 2025-08-28 21:16, Samiullah Khawaja wrote:
quoted
Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

Once enabled user can fetch the PID of the kthread doing NAPI polling
and set affinity, priority and scheduler for it depending on the
low-latency requirements.

Extend the netlink interface to allow enabling/disabling threaded
busypolling at individual napi level.

We use this for our AF_XDP based hard low-latency usecase with usecs
level latency requirement. For our usecase we want low jitter and stable
latency at P99.

Following is an analysis and comparison of available (and compatible)
busy poll interfaces for a low latency usecase with stable P99. This can
be suitable for applications that want very low latency at the expense
of cpu usage and efficiency.

Already existing APIs (SO_BUSYPOLL and epoll) allow busy polling a NAPI
backing a socket, but the missing piece is a mechanism to busy poll a
NAPI instance in a dedicated thread while ignoring available events or
packets, regardless of the userspace API. Most existing mechanisms are
designed to work in a pattern where you poll until new packets or events
are received, after which userspace is expected to handle them.

As a result, one has to hack together a solution using a mechanism
intended to receive packets or events, not to simply NAPI poll. NAPI
threaded busy polling, on the other hand, provides this capability
natively, independent of any userspace API. This makes it really easy to
setup and manage.

For analysis we use an AF_XDP based benchmarking tool `xsk_rr`. The
description of the tool and how it tries to simulate the real workload
is following,

- It sends UDP packets between 2 machines.
- The client machine sends packets at a fixed frequency. To maintain the
     frequency of the packet being sent, we use open-loop sampling. That is
     the packets are sent in a separate thread.
- The server replies to the packet inline by reading the pkt from the
     recv ring and replies using the tx ring.
- To simulate the application processing time, we use a configurable
     delay in usecs on the client side after a reply is received from the
     server.

The xsk_rr tool is posted separately as an RFC for tools/testing/selftest.

We use this tool with following napi polling configurations,

- Interrupts only
- SO_BUSYPOLL (inline in the same thread where the client receives the
     packet).
- SO_BUSYPOLL (separate thread and separate core)
- Threaded NAPI busypoll

System is configured using following script in all 4 cases,
echo 0 | sudo tee /sys/class/net/eth0/threaded
echo 0 | sudo tee /proc/sys/kernel/timer_migration
echo off | sudo tee  /sys/devices/system/cpu/smt/control

sudo ethtool -L eth0 rx 1 tx 1
sudo ethtool -G eth0 rx 1024

echo 0 | sudo tee /proc/sys/net/core/rps_sock_flow_entries
echo 0 | sudo tee /sys/class/net/eth0/queues/rx-0/rps_cpus

    # pin IRQs on CPU 2
IRQS="$(gawk '/eth0-(TxRx-)?1/ {match($1, /([0-9]+)/, arr); \
                                print arr[0]}' < /proc/interrupts)"
for irq in "${IRQS}"; \
        do echo 2 | sudo tee /proc/irq/$irq/smp_affinity_list; done

echo -1 | sudo tee /proc/sys/kernel/sched_rt_runtime_us

for i in /sys/devices/virtual/workqueue/*/cpumask; \
                        do echo $i; echo 1,2,3,4,5,6 > $i; done

if [[ -z "$1" ]]; then
     echo 400 | sudo tee /proc/sys/net/core/busy_read
     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
     echo 15000   | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi

sudo ethtool -C eth0 adaptive-rx off adaptive-tx off rx-usecs 0 tx-usecs 0

if [[ "$1" == "enable_threaded" ]]; then
     echo 0 | sudo tee /proc/sys/net/core/busy_poll
     echo 0 | sudo tee /proc/sys/net/core/busy_read
     echo 100 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
     echo 2 | sudo tee /sys/class/net/eth0/threaded
     NAPI_T=$(ps -ef | grep napi | grep -v grep | awk '{ print $2 }')
     sudo chrt -f  -p 50 $NAPI_T

     # pin threaded poll thread to CPU 2
     sudo taskset -pc 2 $NAPI_T
fi

if [[ "$1" == "enable_interrupt" ]]; then
     echo 0 | sudo tee /proc/sys/net/core/busy_read
     echo 0 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
     echo 15000 | sudo tee /sys/class/net/eth0/gro_flush_timeout
fi
The experiment script above does not work, because the sysfs parameter
does not exist anymore in this version.
quoted
To enable various configurations, script can be run as following,

- Interrupt Only
     <script> enable_interrupt
- SO_BUSYPOLL (no arguments to script)
     <script>
- NAPI threaded busypoll
     <script> enable_threaded
If using idpf, the script needs to be run again after launching the
workload just to make sure that the configurations are not reverted. As
idpf reverts some configurations on software reset when AF_XDP program
is attached.

Once configured, the workload is run with various configurations using
following commands. Set period (1/frequency) and delay in usecs to
produce results for packet frequency and application processing delay.

    ## Interrupt Only and SO_BUSYPOLL (inline)

- Server
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 -h -v
- Client
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v
    ## SO_BUSYPOLL(done in separate core using recvfrom)

Argument -t spawns a seprate thread and continuously calls recvfrom.

- Server
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
        -h -v -t
- Client
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -t
    ## NAPI Threaded Busy Poll

Argument -n skips the recvfrom call as there is no recv kick needed.

- Server
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -D <IP-dest> -S <IP-src> -M <MAC-dst> -m <MAC-src> -p 54321 \
        -h -v -n
- Client
sudo chrt -f 50 taskset -c 3-5 ./xsk_rr -o 0 -B 400 -i eth0 -4 \
        -S <IP-src> -D <IP-dest> -m <MAC-src> -M <MAC-dst> -p 54321 \
        -P <Period-usecs> -d <Delay-usecs>  -T -l 1 -v -n
I believe there's a bug when disabling busy-polled napi threading after
an experiment. My system hangs and needs a hard reset.
quoted
| Experiment | interrupts | SO_BUSYPOLL | SO_BUSYPOLL(separate) | NAPI threaded |
|---|---|---|---|---|
| 12 Kpkt/s + 0us delay | | | | |
|  | p5: 12700 | p5: 12900 | p5: 13300 | p5: 12800 |
|  | p50: 13100 | p50: 13600 | p50: 14100 | p50: 13000 |
|  | p95: 13200 | p95: 13800 | p95: 14400 | p95: 13000 |
|  | p99: 13200 | p99: 13800 | p99: 14400 | p99: 13000 |
| 32 Kpkt/s + 30us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13100 | p5: 12800 |
|  | p50: 21100 | p50: 17000 | p50: 13700 | p50: 13000 |
|  | p95: 21200 | p95: 17100 | p95: 14000 | p95: 13000 |
|  | p99: 21200 | p99: 17100 | p99: 14000 | p99: 13000 |
| 125 Kpkt/s + 6us delay | | | | |
|  | p5: 14600 | p5: 17100 | p5: 13300 | p5: 12900 |
|  | p50: 15400 | p50: 17400 | p50: 13800 | p50: 13100 |
|  | p95: 15600 | p95: 17600 | p95: 14000 | p95: 13100 |
|  | p99: 15600 | p99: 17600 | p99: 14000 | p99: 13100 |
| 12 Kpkt/s + 78us delay | | | | |
|  | p5: 14100 | p5: 16700 | p5: 13200 | p5: 12600 |
|  | p50: 14300 | p50: 17100 | p50: 13900 | p50: 12800 |
|  | p95: 14300 | p95: 17200 | p95: 14200 | p95: 12800 |
|  | p99: 14300 | p99: 17200 | p99: 14200 | p99: 12800 |
| 25 Kpkt/s + 38us delay | | | | |
|  | p5: 19900 | p5: 16600 | p5: 13000 | p5: 12700 |
|  | p50: 21000 | p50: 17100 | p50: 13800 | p50: 12900 |
|  | p95: 21100 | p95: 17100 | p95: 14100 | p95: 12900 |
|  | p99: 21100 | p99: 17100 | p99: 14100 | p99: 12900 |
On my system, routing the irq to same core where xsk_rr runs results in
lower latency than routing the irq to a different core. To me that makes
sense in a low-rate latency-sensitive scenario where interrupts are not
causing much trouble, but the resulting locality might be beneficial. I
think you should test this as well.

The experiments reported above (except for the first one) are
cherry-picking parameter combinations that result in a near-100% load
and ignore anything else. Near-100% load is a highly unlikely scenario
for a latency-sensitive workload.

When combining the above two paragraphs, I believe other interesting
setups are missing from the experiments, such as comparing to two pairs
of xsk_rr under high load (as mentioned in my previous emails).
This is to support an existing real workload. We cannot easily modify
its threading model. The two xsk_rr model would be a different
workload.
That's fine, but:

- In principle I don't think it's a good justification for a kernel
change that an application cannot be rewritten.

- I believe it is your responsibility to more comprehensively document
the impact of your proposed changes beyond your one particular workload.

Also, I do believe there's a bug as mentioned before. I can't quite pin
it down, but every time after running a "NAPI threaded" experiment, my
servers enters a funny state and eventually becomes largely unresponsive
without much useful output and needs a hard reset. For example:

1) Run "NAPI threaded" experiment
2) Disabled "threaded" parameter in NAPI config
3) Run IRQ experiment -> xsk_rr hangs and apparently holds a lock,
because other services stop working successively.
I just tried with this scenario and it seems to work fine.
Ok. I've reproduced it more concisely. This is after a fresh reboot:

sudo ethtool -L ens15f1np1 combined 1

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "busy-poll-enabled"}'

# ping from another machine to this NIC works
# napi thread busy at 100%

sudo net-next/tools/net/ynl/pyynl/cli.py --no-schema --output-json\
  --spec net-next/Documentation/netlink/specs/netdev.yaml --do napi-set\
  --json='{"id": 8209, "threaded": "disabled"}'

# napi thread gone
# ping from another machine does not work
# tcpdump does not show incoming icmp packets
# but machine still responsive on other NIC

sudo ethtool -L ens15f1np1 combined 12
Ok I have found it. It's related to stopping the kthreads. Will send a
revision out.
# networking hangs on all NICs
# sudo reboot on console hangs
# hard reset needed, no useful output
quoted
quoted
Do you not have this problem?
Not Really. Jakub actually fixed a deadlock in napi threaded recently.
Maybe you are hitting that? Are you using the latest base-commit that
I have in this patch series?
Yep:
- Ubuntu 24.04.3 LTS system
- base commit before patches is c3199adbe4ffffc7b6536715e0290d1919a45cd9
- NIC driver is ice, PCI id 8086:159b.

Let me know, if you need any other information?

Best,
Martin
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help