Thread (32 messages) 32 messages, 2 authors, 2020-05-26

Re: iproute2: tc deletion freezes whole server

From: Václav Zindulka <hidden>
Date: 2020-05-18 14:16:52

On Sun, May 17, 2020 at 9:35 PM Cong Wang [off-list ref] wrote:
On Fri, May 8, 2020 at 6:59 AM Václav Zindulka
[off-list ref] wrote:
quoted
quoted
quoted
quoted
I tried to emulate your test case in my VM, here is the script I use:

====
ip li set dev dummy0 up
tc qd add dev dummy0 root handle 1: htb default 1
for i in `seq 1 1000`
do
  tc class add dev dummy0 parent 1:0 classid 1:$i htb rate 1mbit ceil 1.5mbit
  tc qd add dev dummy0 parent 1:$i fq_codel
done

time tc qd del dev dummy0 root
====

And this is the result:

    Before my patch:
     real   0m0.488s
     user   0m0.000s
     sys    0m0.325s

    After my patch:
     real   0m0.180s
     user   0m0.000s
     sys    0m0.132s
My results with your test script.

before patch:
/usr/bin/time -p tc qdisc del dev enp1s0f0 root
real 1.63
user 0.00
sys 1.63

after patch:
/usr/bin/time -p tc qdisc del dev enp1s0f0 root
real 1.55
user 0.00
sys 1.54
quoted
This is an obvious improvement, so I have no idea why you didn't
catch any difference.
We use hfsc instead of htb. I don't know whether it may cause any
difference. I can provide you with my test scripts if necessary.
Yeah, you can try to replace the htb with hfsc in my script,
I didn't spend time to figure out hfsc parameters.
class add dev dummy0 parent 1:0 classid 1:$i hfsc ls m1 0 d 0 m2
13107200 ul m1 0 d 0 m2 13107200

but it behaves the same as htb...
quoted
My point here is, if I can see the difference with merely 1000
tc classes, you should see a bigger difference with hundreds
of thousands classes in your setup. So, I don't know why you
saw a relatively smaller difference.
I saw a relatively big difference. It was about 1.5s faster on my huge
setup which is a lot. Yet maybe the problem is caused by something
What percentage? IIUC, without patch it took you about 11s, so
1.5s faster means 13% improvement for you?
My whole setup needs 22.17 seconds to delete with an unpatched kernel.
With your patches applied it is 21.08. So it varies between 1 - 1.5s.
Improvement is about 5 - 6%.
quoted
else? I thought about tx/rx queues. RJ45 ports have up to 4 tx and rx
queues. SFP+ interfaces have much higher limits. 8 or even 64 possible
queues. I've tried to increase the number of queues using ethtool from
4 to 8 and decreased to 2. But there was no difference. It was about
1.62 - 1.63 with an unpatched kernel and about 1.55 - 1.58 with your
patches applied. I've tried it for ifb and RJ45 interfaces where it
took about 0.02 - 0.03 with an unpatched kernel and 0.05 with your
patches applied, which is strange, but it may be caused by the fact it
was very fast even before.
That is odd. In fact, this is highly related to number of TX queues,
because the existing code resets the qdisc's once for each TX
queue, so the more TX queues you have, the more resets kernel
will do, that is the more time it will take.
Can't the problem be caused that reset is done on active and inactive
queues every time? It would explain why it had no effect in decreasing
and increasing the number of active queues. Yet it doesn't explain why
Intel card (82599ES) with 64 possible queues has exactly the same
problem as Mellanox (ConnectX-4 LX) with 8 possible queues.

I've been playing with this problem today. Every deleted fq_codel
qdisc, root and non root, requires a network adapter to empty all
possible queues. But not just the active ones, but it cleared all
possible queues. Event those the adapter can't even use. For every
SFP+ I have tested it calls fq_codel_reset() and would call any other
reset function. 64 times for egress qdisc and 64 times for ingress
qdisc. Even when ingress is not defined. Solution to this whole
problem would be to let reset only activated queues. On the RJ45
interface I've tested there are 8 possible queues. So the reset is
done 8 times for every deleted qdisc. 16 in total, since ingress and
egress are processed all the time.

So a little bit of calculation. My initial setup contained 13706 qdisc
rules. For ifb it means 13706 for egress and 13706 for ingress. 27412
reset calls because there can be only one transmit queue for the ifb
interface. Average time spent between fq_codel_reset() according to my
initial perf reports is somewhat between 7 - 16 micro seconds. 27412 *
0.000008 = 0.219296 s. For RJ45 interface it does 8 calls for every
qdisc. 13706 * 8 * 2 = 219296 resets. 219296 * 0.000008 = 1.754368 s.
It is still ok. For SFP+ interface it is 64 resets per qdisc rule.
13706 * 64 * 2 = 1754368. So we are very close to the huge number I
noticed initially using printk. And now 1754368 * 0.000008 =
14.034944s. In case of slowest calls 1754368 * 0.000016 = 28.069888.
Woala. Gotcha. So my final judgement is - don't empty something we
don't use anyway. For Intel it may be reasonable, it can have all 64
queues defined. Mellanox has its limit at 8. But it still is being
reset 64 times. We mostly decrease the number of queues to 4.
Sometimes 2 according to the CPU used. Yet every CPU had to handle 64
resets.

With the attached patch I'm down to 1.7 seconds - more than 90%
improvement :-) Can you please check it and pass it to proper places?
According to debugging printk messages it empties only active queues.

Thank you for all your help and effort.

I plan to address this later on top of the existing patches.

Thanks.

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help