Re: Kernel 4.6.7-rt13: Intel Ethernet driver igb causes huge latencies in cyclictest

From: Koehrer Mathias (ETAS/ESW5) <hidden>
Date: 2016-10-07 08:58:11
Also in: intel-wired-lan, linux-rt-users

Hi Mitch,

quoted

Although, to be clear, it isn't the fact that there exists 8
threads, it's

that the device is

quoted

firing all 8 interrupts at the same time.  The time spent in hardirq

context just waking

quoted

up all 8 of those threads (and the cyclictest wakeup) is enough to
cause

your

quoted

regression.

netdev/igb folks-

Under what conditions should it be expected that the i350 trigger
all of

the TxRx

quoted

interrupts simultaneously?  Any ideas here?

I can answer that! I wrote that code.

We trigger the interrupts once a second because MSI and MSI-X interrupts are NOT
guaranteed to be delivered. If this happens, the queues being serviced by this "lost"
interrupt are completely stuck.

The device automatically masks each interrupt vector after it fires, expecting the
ISR to re-enable the vector after processing is complete. If the interrupt is lost, the
ISR doesn't run, so the vector ends up permanently masked. At this point, any
queues associated with that vector are stuck. The only recovery is through the
netdev watchdog, which initiates a reset.

During development of igb, we had several platforms with chipsets that routinely
dropped MSI messages under stress. Things would be running fine and then, pow,
all the traffic on a queue would stop.

So, I added code to fire each vector once per second. Just unmasking the interrupt
isn't enough - we need to trigger the ISR to get the queues cleaned up so the device
can work again.

Is this workaround still needed? I don't know. Modern chipsets don't break a sweat
handling gigabit-speed traffic, and they *probably* don't drop interrupts. But I'd still
rather have that insurance.

You could try to remove the write to the EICS registers in the watchdog task to see if
that takes care of your problem. But I wouldn't want to remove that code
permanently, because we have seen lost interrupts in the past.

You also could try staggering the writes so that not all vectors fire each second. But
then you'll potentially incur a much longer delay if an interrupt does get lost, which
means you could trigger netdev watchdog events.

-Mitch

Thanks for the explanation. Now I have switched over to kernel 4.8-rt1.
This shows the very same effect.
I modified the in-kernel's igb_main.c (function igb_watchdog_task) to comment out
the access to the EICS registers:

--- igb_main.c.orig     2016-10-07 10:43:37.855873754 +0200
+++ igb_main.c  2016-10-07 10:31:51.451346292 +0200

@@ -4613,6 +4613,7 @@ no_wait:
        }

        /* Cause software interrupt to ensure Rx ring is cleaned */
+#if 0
        if (adapter->flags & IGB_FLAG_HAS_MSIX) {
                u32 eics = 0;

@@ -4622,6 +4623,7 @@ no_wait:
        } else {
                wr32(E1000_ICS, E1000_ICS_RXDMT0);
        }
+#endif

        igb_spoof_check(adapter);
        igb_ptp_rx_hang(adapter);

The result is now slighty better, however the max latency that has been measured by
cyclictest is still much higher compared to the values of kernel 3.18.27-rt27.
I have now enabled all events, the execution of 
# cyclictest -a -i 105 -m -n -p 80 -t 1  -b 30 -C
delivers the following trace values


cyclicte-10062   0....... 10025813402us : sys_exit: NR 230 = 0
cyclicte-10062   0....... 10025813402us : sys_enter: NR 230 (1, 1, 7ffff73ff930, 0, 0, 2bd7e12e)
cyclicte-10062   0....... 10025813402us : hrtimer_init: hrtimer=ffff88040a017e18 clockid=CLOCK_MONOTONIC mode=HRTIMER_MODE_ABS
cyclicte-10062   0d...1.. 10025813403us : hrtimer_start: hrtimer=ffff88040a017e18 function=hrtimer_wakeup expires=10024735653388 softexpires=10024735653388
cyclicte-10062   0d...1.. 10025813403us : write_msr: 6e0, value 20ca630b9aef
cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: Start context switch
cyclicte-10062   0d...1.. 10025813403us : rcu_utilization: End context switch
cyclicte-10062   0d...2.. 10025813404us : sched_switch: prev_comm=cyclictest prev_pid=10062 prev_prio=19 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120
cyclicte-10062   0d...2.. 10025813404us+: x86_fpu_regs_deactivated: x86/fpu: ffff88040c603ec0 fpregs_active: 0 fpstate_active: 1 counter: 69 xfeatures: 2 xcomp_bv: 0
kworker/-1388    0d..h... 10025813468us : irq_handler_entry: irq=48 name=eth2-tx-0
kworker/-1388    0d..h... 10025813468us : irq_handler_exit: irq=48 ret=handled
kworker/-1388    0d..h1.. 10025813469us : sched_waking: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
kworker/-1388    0dN.h2.. 10025813469us : sched_wakeup: comm=irq/48-eth2-tx- pid=10057 prio=49 target_cpu=000
kworker/-1388    0dN.h1.. 10025813470us : irq_handler_entry: irq=47 name=eth2-rx-0
kworker/-1388    0dN.h1.. 10025813470us : irq_handler_exit: irq=47 ret=handled
kworker/-1388    0dN.h2.. 10025813471us : sched_waking: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000
kworker/-1388    0dN.h3.. 10025813471us : sched_wakeup: comm=irq/47-eth2-rx- pid=10056 prio=49 target_cpu=000
kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: Start context switch
kworker/-1388    0dN..1.. 10025813472us : rcu_utilization: End context switch
kworker/-1388    0dN..2.. 10025813472us : sched_stat_runtime: comm=kworker/0:3 pid=1388 runtime=67566 [ns] vruntime=101216288332 [ns]
kworker/-1388    0d...2.. 10025813472us : sched_switch: prev_comm=kworker/0:3 prev_pid=1388 prev_prio=120 prev_state=R+ ==> next_comm=irq/48-eth2-tx- next_pid=10057 next_prio=49
irq/48-e-10057   0d....11 10025813473us : softirq_raise: vec=3 [action=NET_RX]
irq/48-e-10057   0.....12 10025813474us : softirq_entry: vec=3 [action=NET_RX]
irq/48-e-10057   0.....12 10025813475us : napi_poll: napi poll on napi struct ffff88040a582850 for device eth2 work 0 budget 64
irq/48-e-10057   0.....12 10025813475us : softirq_exit: vec=3 [action=NET_RX]
irq/48-e-10057   0d...1.. 10025813475us : rcu_utilization: Start context switch
irq/48-e-10057   0d...1.. 10025813476us : rcu_utilization: End context switch
irq/48-e-10057   0d...2.. 10025813476us : sched_switch: prev_comm=irq/48-eth2-tx- prev_pid=10057 prev_prio=49 prev_state=S ==> next_comm=irq/47-eth2-rx- next_pid=10056 next_prio=49
irq/47-e-10056   0d....11 10025813477us : softirq_raise: vec=3 [action=NET_RX]
irq/47-e-10056   0.....12 10025813477us : softirq_entry: vec=3 [action=NET_RX]
irq/47-e-10056   0.....12 10025813478us : napi_poll: napi poll on napi struct ffff88040a580850 for device eth2 work 0 budget 64
irq/47-e-10056   0.....12 10025813478us : softirq_exit: vec=3 [action=NET_RX]
irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: Start context switch
irq/47-e-10056   0d...1.. 10025813479us : rcu_utilization: End context switch
irq/47-e-10056   0d...2.. 10025813479us+: sched_switch: prev_comm=irq/47-eth2-rx- prev_pid=10056 prev_prio=49 prev_state=S ==> next_comm=kworker/0:3 next_pid=1388 next_prio=120
kworker/-1388    0d..h... 10025813516us : local_timer_entry: vector=239
kworker/-1388    0d..h1.. 10025813516us : hrtimer_interrupt: cpu=0 offset=-28999 curr=kworker/0:3[120] thread=cyclictest[19]
kworker/-1388    0d..h1.. 10025813517us : hrtimer_cancel: hrtimer=ffff88040a017e18
kworker/-1388    0d..h... 10025813517us : hrtimer_expire_entry: hrtimer=ffff88040a017e18 function=hrtimer_wakeup now=10024735682387
kworker/-1388    0d..h1.. 10025813517us : sched_waking: comm=cyclictest pid=10062 prio=19 target_cpu=000
kworker/-1388    0dN.h2.. 10025813517us : sched_wakeup: comm=cyclictest pid=10062 prio=19 target_cpu=000
kworker/-1388    0dN.h... 10025813517us : hrtimer_expire_exit: hrtimer=ffff88040a017e18
kworker/-1388    0dN.h1.. 10025813518us : hrtimer_interrupt: cpu=0 offset=317613 curr=kworker/0:3[120] thread=<none>[-1]
kworker/-1388    0dN.h... 10025813518us : write_msr: 6e0, value 20ca631e99f8
kworker/-1388    0dN.h... 10025813518us : local_timer_exit: vector=239
kworker/-1388    0dN..1.. 10025813518us : rcu_utilization: Start context switch
kworker/-1388    0dN..1.. 10025813518us : rcu_utilization: End context switch
kworker/-1388    0dN..2.. 10025813518us : sched_stat_runtime: comm=kworker/0:3 pid=1388 runtime=38173 [ns] vruntime=101216326505 [ns]
kworker/-1388    0d...2.. 10025813519us : sched_switch: prev_comm=kworker/0:3 prev_pid=1388 prev_prio=120 prev_state=R+ ==> next_comm=cyclictest next_pid=10062 next_prio=19
kworker/-1388    0d...2.. 10025813519us : x86_fpu_regs_activated: x86/fpu: ffff88040c603ec0 fpregs_active: 1 fpstate_active: 1 counter: 70 xfeatures: 2 xcomp_bv: 0
kworker/-1388    0d...2.. 10025813519us : write_msr: c0000100, value 7ffff7400700
cyclicte-10062   0....... 10025813519us : sys_exit: NR 230 = 0
cyclicte-10062   0....... 10025813524us : sys_enter: NR 1 (5, 7ffff7400300, 1f, 7ffff77a5460, 2, 7ffff744c99a)
cyclicte-10062   0.....11 10025813526us : tracing_mark_write: hit latency threshold (32 > 30)
cyclicte-10062   0....... 10025813526us : sys_exit: NR 1 = 31
cyclicte-10062   0....... 10025813527us : sys_enter: NR 1 (4, 4076b0, 2, 7ffff77a5460, 2, 7ffff744c99a)

Also, I notice that there is still roughly one irq every 2 seconds on irq/47 (eth2-rx-0) and irq/48 (eth2-tx-0).

Any idea or feedback on this?

Thanks

Mathias

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help