Thread (30 messages) 30 messages, 9 authors, 2020-09-30

Re: [RFC PATCH net-next 0/6] implement kthread based napi poll

From: Paolo Abeni <pabeni@redhat.com>
Date: 2020-09-30 08:58:17

On Tue, 2020-09-29 at 14:48 -0700, Jakub Kicinski wrote:
On Tue, 29 Sep 2020 13:16:59 -0700 Wei Wang wrote:
quoted
On Tue, Sep 29, 2020 at 12:19 PM Jakub Kicinski [off-list ref] wrote:
quoted
On Mon, 28 Sep 2020 19:43:36 +0200 Eric Dumazet wrote:  
quoted
Wei, this is a very nice work.

Please re-send it without the RFC tag, so that we can hopefully merge it ASAP.  
The problem is for the application I'm testing with this implementation
is significantly slower (in terms of RPS) than Felix's code:

              |        L  A  T  E  N  C  Y       |  App   |     C P U     |
       |  RPS |   AVG  |  P50  |   P99  |   P999 | Overld |  busy |  PSI  |
thread | 1.1% | -15.6% | -0.3% | -42.5% |  -8.1% | -83.4% | -2.3% | 60.6% |
work q | 4.3% | -13.1% |  0.1% | -44.4% |  -1.1% |   2.3% | -1.2% | 90.1% |
TAPI   | 4.4% | -17.1% | -1.4% | -43.8% | -11.0% | -60.2% | -2.3% | 46.7% |

thread is this code, "work q" is Felix's code, TAPI is my hacks.

The numbers are comparing performance to normal NAPI.

In all cases (but not the baseline) I configured timer-based polling
(defer_hard_irqs), with around 100us timeout. Without deferring hard
IRQs threaded NAPI is actually slower for this app. Also I'm not
modifying niceness, this again causes application performance
regression here.
 
If I remember correctly, Felix's workqueue code uses HIGHPRIO flag
which by default uses -20 as the nice value for the workqueue threads.
But the kthread implementation leaves nice level as 20 by default.
This could be 1 difference.
FWIW this is the data based on which I concluded the nice -20 actually
makes things worse here:

      threded: -1.50%
 threded p-20: -5.67%
     thr poll:  2.93%
thr poll p-20:  2.22%

Annoyingly relative performance change varies day to day and this test
was run a while back (over the weekend I was getting < 2% improvement
with this set).
I'm assuming your application uses UDP as the transport protocol - raw
IP or packet socket should behave in the same way. I observed similar
behavior - that is unstable figures, and end-to-end tput decrease when
network stack get more cycles (or become faster) - when the bottle-neck 
was in user-space processing[1].

You can double check you are hitting the same scenario observing the
UDP protocol stats (you should see higher drops figures with threaded
and even more with threded p-20, compared to the other impls).

If you are hitting such scenario, you should be able to improve things
setting nice-20 to the user-space process, increasing the UDP socket
receive buffer size or enabling socket busy polling
(/proc/sys/net/core/busy_poll, I mean). 

Cheers,

Paolo

[1] Perhaps that is obvious to you, but I personally was confused the
first time I observed this fact. There is a nice paper from Luigi Rizzo
explaining why that happen:
http://www.iet.unipi.it/~a007834/papers/2016-ancs-cvt.pdf
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help