RE: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

From: Williams, Mitch A <hidden>
Date: 2016-10-06 17:58:33
Also in: intel-wired-lan, linux-rt-users

-----Original Message-----
From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
Behalf Of Koehrer Mathias (ETAS/ESW5)
Sent: Thursday, October 06, 2016 12:02 AM
To: Julia Cartwright <redacted>; Kirsher, Jeffrey T
[off-list ref]; Greg [off-list ref]
Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; linux-rt-
users@vger.kernel.org; Sebastian Andrzej Siewior
[off-list ref]
Subject: Re: [Intel-wired-lan] Kernel 4.6.7-rt13: Intel Ethernet driver igb
causes huge latencies in cyclictest

Hi all,

quoted

Although, to be clear, it isn't the fact that there exists 8 threads, it's

that the device is

quoted

firing all 8 interrupts at the same time.  The time spent in hardirq

context just waking

quoted

up all 8 of those threads (and the cyclictest wakeup) is enough to cause

your

quoted

regression.

netdev/igb folks-

Under what conditions should it be expected that the i350 trigger all of

the TxRx

quoted

interrupts simultaneously?  Any ideas here?

I can answer that! I wrote that code.

We trigger the interrupts once a second because MSI and MSI-X interrupts are NOT guaranteed to be delivered. If this happens, the queues being serviced by this "lost" interrupt are completely stuck.

The device automatically masks each interrupt vector after it fires, expecting the ISR to re-enable the vector after processing is complete. If the interrupt is lost, the ISR doesn't run, so the vector ends up permanently masked. At this point, any queues associated with that vector are stuck. The only recovery is through the netdev watchdog, which initiates a reset.

During development of igb, we had several platforms with chipsets that routinely dropped MSI messages under stress. Things would be running fine and then, pow, all the traffic on a queue would stop.

So, I added code to fire each vector once per second. Just unmasking the interrupt isn't enough - we need to trigger the ISR to get the queues cleaned up so the device can work again.

Is this workaround still needed? I don't know. Modern chipsets don't break a sweat handling gigabit-speed traffic, and they *probably* don't drop interrupts. But I'd still rather have that insurance.

You could try to remove the write to the EICS registers in the watchdog task to see if that takes care of your problem. But I wouldn't want to remove that code permanently, because we have seen lost interrupts in the past.

You also could try staggering the writes so that not all vectors fire each second. But then you'll potentially incur a much longer delay if an interrupt does get lost, which means you could trigger netdev watchdog events.

-Mitch

quoted

See the start of this thread here:

  http://lkml.kernel.org/r/d648628329bc446fa63b5e19d4d3fb56@FE-
MBX1012.de.bosch.com

Greg recommended to use "ethtool -L eth2 combined 1" to reduce the number of
queues.
I tried that. Now, I have actually only three irqs (eth2, eth2-rx-0, eth2-
tx-0).
However the issue remains the same.

I ran the cyclictest again:
# cyclictest -a -i 105 -m -n -p 80 -t 1  -b 23 -C
(Note: When using 105us instead of 100us the long latencies seem to occur
more often).

Here are the final lines of the kernel trace output:
  <idle>-0       4d...2.. 1344661649us : sched_switch: prev_comm=swapper/4
prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcuc/4 next_pid=56
next_prio=98
ktimerso-46      3d...2.. 1344661650us : sched_switch:
prev_comm=ktimersoftd/3 prev_pid=46 prev_prio=98 prev_state=S ==>
next_comm=swapper/3 next_pid=0 next_prio=120
ktimerso-24      1d...2.. 1344661650us : sched_switch:
prev_comm=ktimersoftd/1 prev_pid=24 prev_prio=98 prev_state=S ==>
next_comm=swapper/1 next_pid=0 next_prio=120
ktimerso-79      6d...2.. 1344661650us : sched_switch:
prev_comm=ktimersoftd/6 prev_pid=79 prev_prio=98 prev_state=S ==>
next_comm=swapper/6 next_pid=0 next_prio=120
ktimerso-35      2d...2.. 1344661650us : sched_switch:
prev_comm=ktimersoftd/2 prev_pid=35 prev_prio=98 prev_state=S ==>
next_comm=swapper/2 next_pid=0 next_prio=120
  rcuc/5-67      5d...2.. 1344661650us : sched_switch: prev_comm=rcuc/5
prev_pid=67 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/5
next_pid=68 next_prio=98
  rcuc/7-89      7d...2.. 1344661650us : sched_switch: prev_comm=rcuc/7
prev_pid=89 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/7
next_pid=90 next_prio=98
ktimerso-4       0d...211 1344661650us : sched_wakeup: comm=rcu_preempt
pid=8 prio=98 target_cpu=000
  rcuc/4-56      4d...2.. 1344661651us : sched_switch: prev_comm=rcuc/4
prev_pid=56 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/4
next_pid=57 next_prio=98
ktimerso-4       0d...2.. 1344661651us : sched_switch:
prev_comm=ktimersoftd/0 prev_pid=4 prev_prio=98 prev_state=S ==>
next_comm=rcu_preempt next_pid=8 next_prio=98
ktimerso-90      7d...2.. 1344661651us : sched_switch:
prev_comm=ktimersoftd/7 prev_pid=90 prev_prio=98 prev_state=S ==>
next_comm=swapper/7 next_pid=0 next_prio=120
ktimerso-68      5d...2.. 1344661651us : sched_switch:
prev_comm=ktimersoftd/5 prev_pid=68 prev_prio=98 prev_state=S ==>
next_comm=swapper/5 next_pid=0 next_prio=120
rcu_pree-8       0d...3.. 1344661652us : sched_wakeup: comm=rcuop/0 pid=10
prio=120 target_cpu=000
ktimerso-57      4d...2.. 1344661652us : sched_switch:
prev_comm=ktimersoftd/4 prev_pid=57 prev_prio=98 prev_state=S ==>
next_comm=swapper/4 next_pid=0 next_prio=120
rcu_pree-8       0d...2.. 1344661653us+: sched_switch: prev_comm=rcu_preempt
prev_pid=8 prev_prio=98 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344661741us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344661742us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0d...2.. 1344661743us : sched_switch: prev_comm=cyclictest
prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=rcuop/0 next_pid=10
next_prio=120
 rcuop/0-10      0d...2.. 1344661744us!: sched_switch: prev_comm=rcuop/0
prev_pid=10 prev_prio=120 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344661858us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344661859us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0d...2.. 1344661860us!: sched_switch: prev_comm=cyclictest
prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344661966us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344661966us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0d...2.. 1344661967us+: sched_switch: prev_comm=cyclictest
prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344662052us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344662053us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0d...2.. 1344662054us!: sched_switch: prev_comm=cyclictest
prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344662168us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344662168us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0d...2.. 1344662169us+: sched_switch: prev_comm=cyclictest
prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5
next_prio=120
kworker/-5       0dN.h2.. 1344662255us : sched_wakeup: comm=irq/48-eth2-tx-
pid=6310 prio=49 target_cpu=000
kworker/-5       0dN.h3.. 1344662256us : sched_wakeup: comm=irq/47-eth2-rx-
pid=6309 prio=49 target_cpu=000
kworker/-5       0d...2.. 1344662256us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=irq/48-eth2-tx-
next_pid=6310 next_prio=49
irq/48-e-6310    0d...2.. 1344662259us : sched_switch: prev_comm=irq/48-
eth2-tx- prev_pid=6310 prev_prio=49 prev_state=S ==> next_comm=irq/47-eth2-
rx- next_pid=6309 next_prio=49
irq/47-e-6309    0d...2.. 1344662260us+: sched_switch: prev_comm=irq/47-
eth2-rx- prev_pid=6309 prev_prio=49 prev_state=S ==> next_comm=kworker/0:0
next_pid=5 next_prio=120
kworker/-5       0dN.h2.. 1344662300us : sched_wakeup: comm=cyclictest
pid=6314 prio=19 target_cpu=000
kworker/-5       0d...2.. 1344662300us : sched_switch: prev_comm=kworker/0:0
prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest
next_pid=6314 next_prio=19
cyclicte-6314    0.....11 1344662306us : tracing_mark_write: hit latency
threshold (39 > 23)

Just before the long latency,  the irqs "48-eth2-tx" and "48-eth2-rx" are
active.
When looking at the 4th line from the bottom, the time for irq/47 is
1344662260us, for the next line (kworker) it is 1344662300us.
Does this mean that the irq/47 took 40us for irq processing? Or is this a
misinterpretation?

For more lines of the trace please see the attached trace-extract.gz.

Thanks for any feedback.

Regard

Mahias

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help