Re: [PATCH net v4 1/2] net: sched: fix packet stuck problem for lockless qdisc

From: Yunsheng Lin <hidden>
Date: 2021-04-21 01:53:02
Also in: bpf, linux-can, lkml

On 2021/4/21 4:34, Michal Kubecek wrote:

On Tue, Apr 20, 2021 at 01:55:03AM +0200, Michal Kubecek wrote:

quoted

On Mon, Apr 19, 2021 at 05:29:46PM +0200, Michal Kubecek wrote:

quoted

As pointed out in the discussion on v3, this patch may result in
significantly higher CPU consumption with multiple threads competing on
a saturated outgoing device. I missed this submission so that I haven't
checked it yet but given the description of v3->v4 changes above, it's
quite likely that it suffers from the same problem.

And it indeed does. However, with the additional patch from the v3
discussion, the numbers are approximately the same as with an unpatched
mainline kernel.

As with v4, I tried this patch on top of 5.12-rc7 with real devices.
I used two machines with 10Gb/s Intel ixgbe NICs, sender has 16 CPUs
(2 8-core CPUs with HT disabled) and 16 Rx/Tx queues, receiver has
48 CPUs (2 12-core CPUs with HT enabled) and 48 Rx/Tx queues.

  threads    5.12-rc7    5.12-rc7 + v4    5.12-rc7 + v4 + stop
     1        25.1%          38.1%            22.9%
     8        66.2%         277.0%            74.1%
    16        90.1%         150.7%            91.0%
    32       107.2%         272.6%           108.3%
    64       116.3%         487.5%           118.1%
   128       126.1%         946.7%           126.9%

(The values are normalized to one core, i.e. 100% corresponds to one
fully used logical CPU.)

I repeated the tests few more times and with more iterations and it
seems the problem rather was that the CPU utilization numbers are not
very stable, in particular with number of connections/threads close to
the number of CPUs and Tx queues. Refined results (and also other tests)
show that full 3-patch series performs similar to unpatched 5.12-rc7
(within the margin of statistical error).

Thanks for the clarifying.
I did a similar test yesterday(using 10Gb netdev and the iperf tool I has),
the initial result suggested the same.

However, I noticed something disturbing in the results of a simple
1-thread TCP_STREAM test (client sends data through a TCP connection to
server using long writes, we measure the amount of data received by the
server):

  server: 172.17.1.1, port 12543
  iterations: 20, threads: 1, test length: 30
  test: TCP_STREAM, message size: 1048576
  
  1     927403548.4 B/s,  avg   927403548.4 B/s, mdev           0.0 B/s (  0.0%)
  2    1176317172.1 B/s,  avg  1051860360.2 B/s, mdev   124456811.8 B/s ( 11.8%), confid. +/-  1581348251.3 B/s (150.3%)
  3     927335837.8 B/s,  avg  1010352186.1 B/s, mdev   117354970.3 B/s ( 11.6%), confid. +/-   357073677.2 B/s ( 35.3%)
  4    1176728045.1 B/s,  avg  1051946150.8 B/s, mdev   124576544.7 B/s ( 11.8%), confid. +/-   228863127.8 B/s ( 21.8%)
  5    1176788216.3 B/s,  avg  1076914563.9 B/s, mdev   122102985.3 B/s ( 11.3%), confid. +/-   169478943.5 B/s ( 15.7%)
  6    1158167055.1 B/s,  avg  1090456645.8 B/s, mdev   115504209.5 B/s ( 10.6%), confid. +/-   132805140.8 B/s ( 12.2%)
  7    1176243474.4 B/s,  avg  1102711907.0 B/s, mdev   111069717.1 B/s ( 10.1%), confid. +/-   110956822.2 B/s ( 10.1%)
  8    1176771142.8 B/s,  avg  1111969311.5 B/s, mdev   106744173.5 B/s (  9.6%), confid. +/-    95417120.0 B/s (  8.6%)
  9    1176206364.6 B/s,  avg  1119106761.8 B/s, mdev   102644185.2 B/s (  9.2%), confid. +/-    83685200.5 B/s (  7.5%)
  10   1175888409.4 B/s,  avg  1124784926.6 B/s, mdev    98855550.5 B/s (  8.8%), confid. +/-    74537085.1 B/s (  6.6%)
  11   1176541407.6 B/s,  avg  1129490061.2 B/s, mdev    95422224.8 B/s (  8.4%), confid. +/-    67230249.7 B/s (  6.0%)
  12    934185352.8 B/s,  avg  1113214668.9 B/s, mdev   106114984.5 B/s (  9.5%), confid. +/-    70420712.5 B/s (  6.3%)
  13   1176550558.1 B/s,  avg  1118086660.3 B/s, mdev   103339448.9 B/s (  9.2%), confid. +/-    65002902.4 B/s (  5.8%)
  14   1176521808.8 B/s,  avg  1122260599.5 B/s, mdev   100711151.3 B/s (  9.0%), confid. +/-    60333655.0 B/s (  5.4%)
  15   1176744840.8 B/s,  avg  1125892882.3 B/s, mdev    98240838.2 B/s (  8.7%), confid. +/-    56319052.3 B/s (  5.0%)
  16   1176593778.5 B/s,  avg  1129061688.3 B/s, mdev    95909740.8 B/s (  8.5%), confid. +/-    52771633.5 B/s (  4.7%)
  17   1176583967.4 B/s,  avg  1131857116.5 B/s, mdev    93715582.2 B/s (  8.3%), confid. +/-    49669258.6 B/s (  4.4%)
  18   1176853301.8 B/s,  avg  1134356904.5 B/s, mdev    91656530.2 B/s (  8.1%), confid. +/-    46905244.8 B/s (  4.1%)
  19   1176592845.7 B/s,  avg  1136579848.8 B/s, mdev    89709043.8 B/s (  7.9%), confid. +/-    44424855.9 B/s (  3.9%)
  20   1176608117.3 B/s,  avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)
  all                     avg  1138581262.2 B/s, mdev    87871692.6 B/s (  7.7%), confid. +/-    42193098.5 B/s (  3.7%)

Each line shows result of one 30 second long test and average, mean
deviation and 99% confidence interval half width through the iterations
so far. While 17 iteration results are essentially the wire speed minus
TCP overhead, iterations 1, 3 and 12 are more than 20% lower. As results
of the same test on unpatched 5.12-rc7 are much more consistent (the
lowest iteration result through the whole test was 1175939718.3 and the
mean deviation only 276889.1 B/s), it doesn't seeem to be just a random
fluctuation.

I think I need to relearn the statistial math to understand the above
"99% confidence interval half width ":)

But the problem do not seems related too much with "99% confidence
interval half width ", but with "mean deviation"?

I tried using netperf, which seems only show throughput of 9415.06
(10^6bits/sec) using 10G netdev. which tool did you used to show the
above number?

I'll try to find out what happens in these outstanding iterations.

As my understanding, if there is only one thread, the first spin_trylock()
is likely to return true for the xmiting thread, unless the scheduled
net_tx_action() cause the xmiting thread's spin_trylock() returning false?

Thanks for comprehensive testing and analysing.

Michal

.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help