RE: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest
From: Williams, Mitch A <hidden>
Date: 2016-10-06 17:58:33
Also in:
intel-wired-lan, linux-rt-users
-----Original Message----- From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On Behalf Of Koehrer Mathias (ETAS/ESW5) Sent: Thursday, October 06, 2016 12:02 AM To: Julia Cartwright <redacted>; Kirsher, Jeffrey T [off-list ref]; Greg [off-list ref] Cc: netdev@vger.kernel.org; intel-wired-lan@lists.osuosl.org; linux-rt- users@vger.kernel.org; Sebastian Andrzej Siewior [off-list ref] Subject: Re: [Intel-wired-lan] Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest Hi all,quoted
Although, to be clear, it isn't the fact that there exists 8 threads, it'sthat the device isquoted
firing all 8 interrupts at the same time. The time spent in hardirqcontext just wakingquoted
up all 8 of those threads (and the cyclictest wakeup) is enough to causeyourquoted
regression. netdev/igb folks- Under what conditions should it be expected that the i350 trigger all ofthe TxRxquoted
interrupts simultaneously? Any ideas here?
I can answer that! I wrote that code. We trigger the interrupts once a second because MSI and MSI-X interrupts are NOT guaranteed to be delivered. If this happens, the queues being serviced by this "lost" interrupt are completely stuck. The device automatically masks each interrupt vector after it fires, expecting the ISR to re-enable the vector after processing is complete. If the interrupt is lost, the ISR doesn't run, so the vector ends up permanently masked. At this point, any queues associated with that vector are stuck. The only recovery is through the netdev watchdog, which initiates a reset. During development of igb, we had several platforms with chipsets that routinely dropped MSI messages under stress. Things would be running fine and then, pow, all the traffic on a queue would stop. So, I added code to fire each vector once per second. Just unmasking the interrupt isn't enough - we need to trigger the ISR to get the queues cleaned up so the device can work again. Is this workaround still needed? I don't know. Modern chipsets don't break a sweat handling gigabit-speed traffic, and they *probably* don't drop interrupts. But I'd still rather have that insurance. You could try to remove the write to the EICS registers in the watchdog task to see if that takes care of your problem. But I wouldn't want to remove that code permanently, because we have seen lost interrupts in the past. You also could try staggering the writes so that not all vectors fire each second. But then you'll potentially incur a much longer delay if an interrupt does get lost, which means you could trigger netdev watchdog events. -Mitch
quoted
See the start of this thread here: http://lkml.kernel.org/r/d648628329bc446fa63b5e19d4d3fb56@FE- MBX1012.de.bosch.comGreg recommended to use "ethtool -L eth2 combined 1" to reduce the number of queues. I tried that. Now, I have actually only three irqs (eth2, eth2-rx-0, eth2- tx-0). However the issue remains the same. I ran the cyclictest again: # cyclictest -a -i 105 -m -n -p 80 -t 1 -b 23 -C (Note: When using 105us instead of 100us the long latencies seem to occur more often). Here are the final lines of the kernel trace output: <idle>-0 4d...2.. 1344661649us : sched_switch: prev_comm=swapper/4 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcuc/4 next_pid=56 next_prio=98 ktimerso-46 3d...2.. 1344661650us : sched_switch: prev_comm=ktimersoftd/3 prev_pid=46 prev_prio=98 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 ktimerso-24 1d...2.. 1344661650us : sched_switch: prev_comm=ktimersoftd/1 prev_pid=24 prev_prio=98 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120 ktimerso-79 6d...2.. 1344661650us : sched_switch: prev_comm=ktimersoftd/6 prev_pid=79 prev_prio=98 prev_state=S ==> next_comm=swapper/6 next_pid=0 next_prio=120 ktimerso-35 2d...2.. 1344661650us : sched_switch: prev_comm=ktimersoftd/2 prev_pid=35 prev_prio=98 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 rcuc/5-67 5d...2.. 1344661650us : sched_switch: prev_comm=rcuc/5 prev_pid=67 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/5 next_pid=68 next_prio=98 rcuc/7-89 7d...2.. 1344661650us : sched_switch: prev_comm=rcuc/7 prev_pid=89 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/7 next_pid=90 next_prio=98 ktimerso-4 0d...211 1344661650us : sched_wakeup: comm=rcu_preempt pid=8 prio=98 target_cpu=000 rcuc/4-56 4d...2.. 1344661651us : sched_switch: prev_comm=rcuc/4 prev_pid=56 prev_prio=98 prev_state=S ==> next_comm=ktimersoftd/4 next_pid=57 next_prio=98 ktimerso-4 0d...2.. 1344661651us : sched_switch: prev_comm=ktimersoftd/0 prev_pid=4 prev_prio=98 prev_state=S ==> next_comm=rcu_preempt next_pid=8 next_prio=98 ktimerso-90 7d...2.. 1344661651us : sched_switch: prev_comm=ktimersoftd/7 prev_pid=90 prev_prio=98 prev_state=S ==> next_comm=swapper/7 next_pid=0 next_prio=120 ktimerso-68 5d...2.. 1344661651us : sched_switch: prev_comm=ktimersoftd/5 prev_pid=68 prev_prio=98 prev_state=S ==> next_comm=swapper/5 next_pid=0 next_prio=120 rcu_pree-8 0d...3.. 1344661652us : sched_wakeup: comm=rcuop/0 pid=10 prio=120 target_cpu=000 ktimerso-57 4d...2.. 1344661652us : sched_switch: prev_comm=ktimersoftd/4 prev_pid=57 prev_prio=98 prev_state=S ==> next_comm=swapper/4 next_pid=0 next_prio=120 rcu_pree-8 0d...2.. 1344661653us+: sched_switch: prev_comm=rcu_preempt prev_pid=8 prev_prio=98 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344661741us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344661742us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0d...2.. 1344661743us : sched_switch: prev_comm=cyclictest prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=rcuop/0 next_pid=10 next_prio=120 rcuop/0-10 0d...2.. 1344661744us!: sched_switch: prev_comm=rcuop/0 prev_pid=10 prev_prio=120 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344661858us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344661859us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0d...2.. 1344661860us!: sched_switch: prev_comm=cyclictest prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344661966us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344661966us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0d...2.. 1344661967us+: sched_switch: prev_comm=cyclictest prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344662052us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344662053us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0d...2.. 1344662054us!: sched_switch: prev_comm=cyclictest prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344662168us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344662168us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0d...2.. 1344662169us+: sched_switch: prev_comm=cyclictest prev_pid=6314 prev_prio=19 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344662255us : sched_wakeup: comm=irq/48-eth2-tx- pid=6310 prio=49 target_cpu=000 kworker/-5 0dN.h3.. 1344662256us : sched_wakeup: comm=irq/47-eth2-rx- pid=6309 prio=49 target_cpu=000 kworker/-5 0d...2.. 1344662256us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=irq/48-eth2-tx- next_pid=6310 next_prio=49 irq/48-e-6310 0d...2.. 1344662259us : sched_switch: prev_comm=irq/48- eth2-tx- prev_pid=6310 prev_prio=49 prev_state=S ==> next_comm=irq/47-eth2- rx- next_pid=6309 next_prio=49 irq/47-e-6309 0d...2.. 1344662260us+: sched_switch: prev_comm=irq/47- eth2-rx- prev_pid=6309 prev_prio=49 prev_state=S ==> next_comm=kworker/0:0 next_pid=5 next_prio=120 kworker/-5 0dN.h2.. 1344662300us : sched_wakeup: comm=cyclictest pid=6314 prio=19 target_cpu=000 kworker/-5 0d...2.. 1344662300us : sched_switch: prev_comm=kworker/0:0 prev_pid=5 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=6314 next_prio=19 cyclicte-6314 0.....11 1344662306us : tracing_mark_write: hit latency threshold (39 > 23) Just before the long latency, the irqs "48-eth2-tx" and "48-eth2-rx" are active. When looking at the 4th line from the bottom, the time for irq/47 is 1344662260us, for the next line (kworker) it is 1344662300us. Does this mean that the irq/47 took 40us for irq processing? Or is this a misinterpretation? For more lines of the trace please see the attached trace-extract.gz. Thanks for any feedback. Regard Mahias