Re: 5.4.13-rt7 stall on CPU?
From: Udo van den Heuvel <hidden>
Date: 2020-07-20 09:21:10
Hello Sebastian, all, I found another one of these in my logs: Jul 19 05:33:05 vuurmuur named[2363288]: dispatch 0x7f1e3d7fa6c0: shutting down due to TCP receive error: 2001:500:200::b#53: connection reset Jul 20 02:20:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt self-detected stall on CPU Jul 20 02:20:24 vuurmuur kernel: 002: rcu: 2-....: (5250 ticks this GP) idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372 Jul 20 02:20:24 vuurmuur kernel: 002: (t=5250 jiffies g=701989125 q=336) Jul 20 02:20:24 vuurmuur kernel: 002: NMI backtrace for cpu 2 Jul 20 02:20:24 vuurmuur kernel: 002: CPU: 2 PID: 3468730 Comm: ntpd Tainted: G O 5.4.13-rt7 #9 Jul 20 02:20:24 vuurmuur kernel: 002: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015 Jul 20 02:20:24 vuurmuur kernel: 002: Call Trace: Jul 20 02:20:24 vuurmuur kernel: 002: <IRQ> Jul 20 02:20:24 vuurmuur kernel: 002: dump_stack+0x50/0x70 Jul 20 02:20:24 vuurmuur kernel: 002: nmi_cpu_backtrace.cold+0x14/0x53 Jul 20 02:20:24 vuurmuur kernel: 002: ? lapic_can_unplug_cpu.cold+0x3b/0x3b Jul 20 02:20:24 vuurmuur kernel: 002: nmi_trigger_cpumask_backtrace+0x8e/0xa2 Jul 20 02:20:24 vuurmuur kernel: 002: rcu_dump_cpu_stacks+0x8b/0xb9 Jul 20 02:20:24 vuurmuur kernel: 002: rcu_sched_clock_irq.cold+0x17e/0x4fd Jul 20 02:20:24 vuurmuur kernel: 002: ? account_system_index_time+0xa6/0xd0 Jul 20 02:20:24 vuurmuur kernel: 002: update_process_times+0x1f/0x50 Jul 20 02:20:24 vuurmuur kernel: 002: tick_sched_timer+0x5a/0x1c0 Jul 20 02:20:24 vuurmuur kernel: 002: ? tick_switch_to_oneshot.cold+0x74/0x74 Jul 20 02:20:24 vuurmuur kernel: 002: __hrtimer_run_queues+0xba/0x1b0 Jul 20 02:20:24 vuurmuur kernel: 002: hrtimer_interrupt+0x108/0x230 Jul 20 02:20:24 vuurmuur kernel: 002: smp_apic_timer_interrupt+0x61/0xa0 Jul 20 02:20:24 vuurmuur kernel: 002: apic_timer_interrupt+0xf/0x20 Jul 20 02:20:24 vuurmuur kernel: 002: </IRQ> Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0010:lock_release+0x114/0x1a0 Jul 20 02:20:24 vuurmuur kernel: 002: Code: 0a 50 6e 01 75 0e 44 2b 7c 24 08 44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07 00 00 00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 73 48 8b 6c 24 18 4c Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 0018:ffffc90001ebb8f0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffff888054294140 RBX: 000000000000001f RCX: ffffc90001ebb8fc Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI: ffffffff8203bee0 RDI: ffff8880542948f0 Jul 20 02:20:24 vuurmuur kernel: 002: RBP: ffff888054294140 R08: 0000000000020019 R09: 0000000000000000 Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8203bee0 Jul 20 02:20:24 vuurmuur kernel: 002: R13: ffffffff8117b0a2 R14: 0000000000000246 R15: 0000000000000001 Jul 20 02:20:24 vuurmuur kernel: 002: ? do_select+0x132/0x7a0 Jul 20 02:20:24 vuurmuur kernel: 002: do_select+0x14f/0x7a0 Jul 20 02:20:24 vuurmuur kernel: 002: ? select_estimate_accuracy+0x100/0x100 Jul 20 02:20:24 vuurmuur kernel: 002: ? poll_select_finish+0x1d0/0x1d0 Jul 20 02:20:24 vuurmuur kernel: 002: ? poll_select_finish+0x1d0/0x1d0 Jul 20 02:20:24 vuurmuur kernel: 002: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: 002: ? put_prev_task_rt+0x22/0x140 Jul 20 02:20:24 vuurmuur kernel: 002: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: 002: ? __schedule+0x435/0x4f0 Jul 20 02:20:24 vuurmuur kernel: 002: ? _raw_spin_unlock_irq+0x1f/0x40 Jul 20 02:20:24 vuurmuur kernel: 002: ? __schedule+0x435/0x4f0 Jul 20 02:20:24 vuurmuur kernel: 002: ? preempt_schedule_irq+0x31/0x50 Jul 20 02:20:24 vuurmuur kernel: 002: ? retint_kernel+0x1b/0x1b Jul 20 02:20:24 vuurmuur kernel: 002: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: 002: ? core_sys_select+0x5c/0x380 Jul 20 02:20:24 vuurmuur kernel: 002: core_sys_select+0x1d0/0x380 Jul 20 02:20:24 vuurmuur kernel: 002: ? core_sys_select+0x5c/0x380 Jul 20 02:20:24 vuurmuur kernel: 002: ? tty_ldisc_ref_wait+0x27/0x70 Jul 20 02:20:24 vuurmuur kernel: 002: ? __ldsem_down_read_nested+0x5e/0x240 Jul 20 02:20:24 vuurmuur kernel: 002: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: 002: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: 002: ? set_user_sigmask+0x62/0x90 Jul 20 02:20:24 vuurmuur kernel: 002: __x64_sys_pselect6+0x141/0x190 Jul 20 02:20:24 vuurmuur kernel: 002: ? _raw_spin_unlock_irq+0x1f/0x40 Jul 20 02:20:24 vuurmuur kernel: 002: ? sigprocmask+0x6d/0x90 Jul 20 02:20:24 vuurmuur kernel: 002: do_syscall_64+0x77/0x440 Jul 20 02:20:24 vuurmuur kernel: 002: ? nmi_handle+0xc1/0xe0 Jul 20 02:20:24 vuurmuur kernel: 002: ? unregister_nmi_handler+0xb0/0xb0 Jul 20 02:20:24 vuurmuur kernel: 002: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0033:0x7fba5ae5a096 Jul 20 02:20:24 vuurmuur kernel: 002: Code: e8 9f dd f8 ff 4c 8b 0c 24 4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b 74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8 c8 dd f8 ff 8b 04 24 eb Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 002b:00007ffcff164a10 EFLAGS: 00000293 ORIG_RAX: 000000000000010e Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffffffffffffffda RBX: 00005613a62cde78 RCX: 00007fba5ae5a096 Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI: 00007ffcff164b00 RDI: 000000000000001f Jul 20 02:20:24 vuurmuur kernel: 002: RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffcff164a50 Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000000 R11: 0000000000000293 R12: 00005613a62a9c43 Jul 20 02:20:24 vuurmuur kernel: 002: R13: 0000000000000009 R14: ffffffffffffffff R15: 00005613a62a9e1c Jul 20 02:20:24 vuurmuur kernel: rcu: INFO: rcu_preempt self-detected stall on CPU Jul 20 02:20:24 vuurmuur kernel: rcu: #0112-....: (5250 ticks this GP) idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372 Jul 20 02:20:24 vuurmuur kernel: #011(t=5250 jiffies g=701989125 q=336) Jul 20 02:20:24 vuurmuur kernel: NMI backtrace for cpu 2 Jul 20 02:20:24 vuurmuur kernel: CPU: 2 PID: 3468730 Comm: ntpd Tainted: G O 5.4.13-rt7 #9 Jul 20 02:20:24 vuurmuur kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015 Jul 20 02:20:24 vuurmuur kernel: Call Trace: Jul 20 02:20:24 vuurmuur kernel: <IRQ> Jul 20 02:20:24 vuurmuur kernel: dump_stack+0x50/0x70 Jul 20 02:20:24 vuurmuur kernel: nmi_cpu_backtrace.cold+0x14/0x53 Jul 20 02:20:24 vuurmuur kernel: ? lapic_can_unplug_cpu.cold+0x3b/0x3b Jul 20 02:20:24 vuurmuur kernel: nmi_trigger_cpumask_backtrace+0x8e/0xa2 Jul 20 02:20:24 vuurmuur kernel: rcu_dump_cpu_stacks+0x8b/0xb9 Jul 20 02:20:24 vuurmuur kernel: rcu_sched_clock_irq.cold+0x17e/0x4fd Jul 20 02:20:24 vuurmuur kernel: ? account_system_index_time+0xa6/0xd0 Jul 20 02:20:24 vuurmuur kernel: update_process_times+0x1f/0x50 Jul 20 02:20:24 vuurmuur kernel: tick_sched_timer+0x5a/0x1c0 Jul 20 02:20:24 vuurmuur kernel: ? tick_switch_to_oneshot.cold+0x74/0x74 Jul 20 02:20:24 vuurmuur kernel: __hrtimer_run_queues+0xba/0x1b0 Jul 20 02:20:24 vuurmuur kernel: hrtimer_interrupt+0x108/0x230 Jul 20 02:20:24 vuurmuur kernel: smp_apic_timer_interrupt+0x61/0xa0 Jul 20 02:20:24 vuurmuur kernel: apic_timer_interrupt+0xf/0x20 Jul 20 02:20:24 vuurmuur kernel: </IRQ> Jul 20 02:20:24 vuurmuur kernel: RIP: 0010:lock_release+0x114/0x1a0 Jul 20 02:20:24 vuurmuur kernel: Code: 0a 50 6e 01 75 0e 44 2b 7c 24 08 44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07 00 00 00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 73 48 8b 6c 24 18 4c Jul 20 02:20:24 vuurmuur kernel: RSP: 0018:ffffc90001ebb8f0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 Jul 20 02:20:24 vuurmuur kernel: RAX: ffff888054294140 RBX: 000000000000001f RCX: ffffc90001ebb8fc Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI: ffffffff8203bee0 RDI: ffff8880542948f0 Jul 20 02:20:24 vuurmuur kernel: RBP: ffff888054294140 R08: 0000000000020019 R09: 0000000000000000 Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff8203bee0 Jul 20 02:20:24 vuurmuur kernel: R13: ffffffff8117b0a2 R14: 0000000000000246 R15: 0000000000000001 Jul 20 02:20:24 vuurmuur kernel: ? do_select+0x132/0x7a0 Jul 20 02:20:24 vuurmuur kernel: do_select+0x14f/0x7a0 Jul 20 02:20:24 vuurmuur kernel: ? select_estimate_accuracy+0x100/0x100 Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0 Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0 Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: ? put_prev_task_rt+0x22/0x140 Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0 Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40 Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0 Jul 20 02:20:24 vuurmuur kernel: ? preempt_schedule_irq+0x31/0x50 Jul 20 02:20:24 vuurmuur kernel: ? retint_kernel+0x1b/0x1b Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380 Jul 20 02:20:24 vuurmuur kernel: core_sys_select+0x1d0/0x380 Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380 Jul 20 02:20:24 vuurmuur kernel: ? tty_ldisc_ref_wait+0x27/0x70 Jul 20 02:20:24 vuurmuur kernel: ? __ldsem_down_read_nested+0x5e/0x240 Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80 Jul 20 02:20:24 vuurmuur kernel: ? set_user_sigmask+0x62/0x90 Jul 20 02:20:24 vuurmuur kernel: __x64_sys_pselect6+0x141/0x190 Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40 Jul 20 02:20:24 vuurmuur kernel: ? sigprocmask+0x6d/0x90 Jul 20 02:20:24 vuurmuur kernel: do_syscall_64+0x77/0x440 Jul 20 02:20:24 vuurmuur kernel: ? nmi_handle+0xc1/0xe0 Jul 20 02:20:24 vuurmuur kernel: ? unregister_nmi_handler+0xb0/0xb0 Jul 20 02:20:24 vuurmuur kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 Jul 20 02:20:24 vuurmuur kernel: RIP: 0033:0x7fba5ae5a096 Jul 20 02:20:24 vuurmuur kernel: Code: e8 9f dd f8 ff 4c 8b 0c 24 4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b 74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8 c8 dd f8 ff 8b 04 24 eb Jul 20 02:20:24 vuurmuur kernel: RSP: 002b:00007ffcff164a10 EFLAGS: 00000293 ORIG_RAX: 000000000000010e Jul 20 02:20:24 vuurmuur kernel: RAX: ffffffffffffffda RBX: 00005613a62cde78 RCX: 00007fba5ae5a096 Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI: 00007ffcff164b00 RDI: 000000000000001f Jul 20 02:20:24 vuurmuur kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffcff164a50 Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000000 R11: 0000000000000293 R12: 00005613a62a9c43 Jul 20 02:20:24 vuurmuur kernel: R13: 0000000000000009 R14: ffffffffffffffff R15: 00005613a62a9e1c Jul 20 05:03:01 vuurmuur named[2363288]: received control channel command 'flush' Again it appears to concern ntpd. Still on 5.4.13-rt7. Should I boot into a newer kernel to verify the issue is still there? Kind regards, Udo On 07-07-2020 19:47, Sebastian Andrzej Siewior wrote:
On 2020-07-04 05:43:32 [+0200], Udo van den Heuvel wrote:quoted
quoted
I *think* this happened within the loop in __fget_files(). This function is inlined by __fget_light() and the loop has a RCU-section so it would make sense. Do you run something at an elevated priority in the system?ntpd (ntpsec) elevates itself.To the highest possible. Wonderful.quoted
quoted
I don't know what the other part was doing but somehow one of the file descriptors (network sockets probably) was about to be closed while the other side tried to poll() on it.Thanks for explaining.I'm puzzled. On dup() and otherwise on replacement / closing the file pointer is replaced (with an incremented reference count or NULL ) and then the reference count of the old file pointer is decremented. So the zero count could be observed once and then the new pointer should be seen. So it should not loop for seconds. And then, dup2() is only used at startup.