Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

[RFC PATCH net-next 0/6] implement kthread based napi poll · Wei Wang <hidden> · 2020-09-14
[RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Wei Wang <hidden> · 2020-09-14
Re: [RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Hannes Frederic Sowa <hidden> · 2020-09-25
Re: [RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Wei Wang <hidden> · 2020-09-25
Re: [RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Hannes Frederic Sowa <hidden> · 2020-09-26
Re: [RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Paolo Abeni <pabeni@redhat.com> · 2020-09-28
Re: [RFC PATCH net-next 1/6] net: implement threaded-able napi poll loop support · Wei Wang <hidden> · 2020-09-28
[RFC PATCH net-next 2/6] net: add sysfs attribute to control napi threaded mode · Wei Wang <hidden> · 2020-09-14
[RFC PATCH net-next 3/6] net: extract napi poll functionality to __napi_poll() · Wei Wang <hidden> · 2020-09-14
[RFC PATCH net-next 4/6] net: modify kthread handler to use __napi_poll() · Wei Wang <hidden> · 2020-09-14
[RFC PATCH net-next 6/6] net: improve napi threaded config · Wei Wang <hidden> · 2020-09-14
[RFC PATCH net-next 5/6] net: process RPS/RFS work in kthread context · Wei Wang <hidden> · 2020-09-14
Re: [RFC PATCH net-next 5/6] net: process RPS/RFS work in kthread context · Wei Wang <hidden> · 2020-09-18
Re: [RFC PATCH net-next 5/6] net: process RPS/RFS work in kthread context · Eric Dumazet <edumazet@google.com> · 2020-09-21
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Magnus Karlsson <hidden> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Wei Wang <hidden> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Eric Dumazet <hidden> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Stephen Hemminger <stephen@networkplumber.org> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Eric Dumazet <edumazet@google.com> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Stephen Hemminger <stephen@networkplumber.org> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Jakub Kicinski <kuba@kernel.org> · 2020-09-25
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Magnus Karlsson <hidden> · 2020-09-28
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Eric Dumazet <edumazet@google.com> · 2020-09-28
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Wei Wang <hidden> · 2020-09-28
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Jakub Kicinski <kuba@kernel.org> · 2020-09-29
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Wei Wang <hidden> · 2020-09-29
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Jakub Kicinski <kuba@kernel.org> · 2020-09-29
RE: [RFC PATCH net-next 0/6] implement kthread based napi poll · David Laight <hidden> · 2020-09-30
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Paolo Abeni <pabeni@redhat.com> · 2020-09-30
Re: [RFC PATCH net-next 0/6] implement kthread based napi poll · Jakub Kicinski <kuba@kernel.org> · 2020-09-30

From: Jakub Kicinski <kuba@kernel.org>
Date: 2020-09-29 21:48:55

On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote:

On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski [off-list ref] wrote:

quoted

On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote:

quoted

Wei, this is a very nice work.

Please re-send it without the RFC tag, so that we can hopefully merge it ASAP.

The problem is for the application I'm testing with this implementation
is significantly slower (in terms of RPS) than Felix's code:

              |        L  A  T  E  N  C  Y       |  App   |     C P U     |
       |  RPS |   AVG  |  P50  |   P99  |   P999 | Overld |  busy |  PSI  |
thread | 1.1% | -15.6% | -0.3% | -42.5% |  -8.1% | -83.4% | -2.3% | 60.6% |
work q | 4.3% | -13.1% |  0.1% | -44.4% |  -1.1% |   2.3% | -1.2% | 90.1% |
TAPI   | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% |

thread is this code, "work q" is Felix's code, TAPI is my hacks.

The numbers are comparing performance to normal NAPI.

In all cases (but not the baseline) I configured timer-based polling
(defer_hard_irqs), with around 100us timeout. Without deferring hard
IRQs threaded NAPI is actually slower for this app. Also I'm not
modifying niceness, this again causes application performance
regression here.

If I remember correctly, Felix's workqueue code uses HIGHPRIO flag
which by default uses -20 as the nice value for the workqueue threads.
But the kthread implementation leaves nice level as 20 by default.
This could be 1 difference.

FWIW this is the data based on which I concluded the nice -20 actually
makes things worse here:

      threded: -1.50%
 threded p-20: -5.67%
     thr poll:  2.93%
thr poll p-20:  2.22%

Annoyingly relative performance change varies day to day and this test
was run a while back (over the weekend I was getting < 2% improvement
with this set).

I am not sure what the benchmark is doing

Not a benchmark, real workload :)

but one thing to try is to limit the CPUs that run the kthreads to a
smaller # of CPUs. This could bring up the kernel cpu usage to a
higher %, e.g. > 80%, so the scheduler is less likely to schedule
user threads on these CPUs, thus providing isolations between
kthreads and the user threads, and reducing the scheduling overhead.

Yeah... If I do pinning or isolation I can get to 15% RPS improvement
for this application.. no threaded NAPI needed. The point for me is to
not have to do such tuning per app x platform x workload of the day.

This could help if the throughput drop is caused by higher scheduling
latency for the user threads. Another thing to try is to raise the
scheduling class of the kthread from SCHED_OTHER to SCHED_FIFO. This
could help if the throughput drop is caused by the kthreads
experiencing higher scheduling latency.

Isn't the fundamental problem that scheduler works at ms scale while
where we're talking about 100us at most? And AFAICT scheduler doesn't
have a knob to adjust migration cost per process? :(

I just reached out to the kernel experts @FB for their input.

Also let me re-run with a normal prio WQ.

quoted

1 NUMA node. 18 NAPI instances each is around 25% of a single CPU.

I was initially hoping that TAPI would fit nicely as an extension
of this code, but I don't think that will be the case.

Are there any assumptions you're making about the configuration that
I should try to replicate?

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help