Re: [RFC PATCH net-next 0/6] implement kthread based napi poll
From: Wei Wang <hidden>
Date: 2020-09-29 20:17:16
On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski [off-list ref] wrote:
On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote:quoted
Wei, this is a very nice work. Please re-send it without the RFC tag, so that we can hopefully merge it ASAP.The problem is for the application I'm testing with this implementation is significantly slower (in terms of RPS) than Felix's code: | L A T E N C Y | App | C P U | | RPS | AVG | P50 | P99 | P999 | Overld | busy | PSI | thread | 1.1% | -15.6% | -0.3% | -42.5% | -8.1% | -83.4% | -2.3% | 60.6% | work q | 4.3% | -13.1% | 0.1% | -44.4% | -1.1% | 2.3% | -1.2% | 90.1% | TAPI | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% | thread is this code, "work q" is Felix's code, TAPI is my hacks. The numbers are comparing performance to normal NAPI. In all cases (but not the baseline) I configured timer-based polling (defer_hard_irqs), with around 100us timeout. Without deferring hard IRQs threaded NAPI is actually slower for this app. Also I'm not modifying niceness, this again causes application performance regression here.
If I remember correctly, Felix's workqueue code uses HIGHPRIO flag which by default uses -20 as the nice value for the workqueue threads. But the kthread implementation leaves nice level as 20 by default. This could be 1 difference. I am not sure what the benchmark is doing, but one thing to try is to limit the CPUs that run the kthreads to a smaller # of CPUs. This could bring up the kernel cpu usage to a higher %, e.g. > 80%, so the scheduler is less likely to schedule user threads on these CPUs, thus providing isolations between kthreads and the user threads, and reducing the scheduling overhead. This could help if the throughput drop is caused by higher scheduling latency for the user threads. Another thing to try is to raise the scheduling class of the kthread from SCHED_OTHER to SCHED_FIFO. This could help if the throughput drop is caused by the kthreads experiencing higher scheduling latency.
1 NUMA node. 18 NAPI instances each is around 25% of a single CPU. I was initially hoping that TAPI would fit nicely as an extension of this code, but I don't think that will be the case. Are there any assumptions you're making about the configuration that I should try to replicate?