Thread (5 messages) 5 messages, 2 authors, 2020-07-20

Re: 5.4.13-rt7 stall on CPU?

From: Udo van den Heuvel <hidden>
Date: 2020-07-20 09:21:10

Hello Sebastian, all,

I found another one of these in my logs:

Jul 19 05:33:05 vuurmuur named[2363288]: dispatch 0x7f1e3d7fa6c0:
shutting down due to TCP receive error: 2001:500:200::b#53: connection reset
Jul 20 02:20:24 vuurmuur kernel: 002: rcu: INFO: rcu_preempt
self-detected stall on CPU
Jul 20 02:20:24 vuurmuur kernel: 002: rcu:      2-....: (5250 ticks this
GP) idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: 002:   (t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: 002: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: 002: CPU: 2 PID: 3468730 Comm: ntpd
Tainted: G           O      5.4.13-rt7 #9
Jul 20 02:20:24 vuurmuur kernel: 002: Hardware name: To Be Filled By
O.E.M. To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: 002: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: 002:  <IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002:  dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: 002:  ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: 002:
nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: 002:  rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: 002:  ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: 002:  update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: 002:  __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: 002:  hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: 002:  smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: 002:  apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: 002:  </IRQ>
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: 002: Code: 0a 50 6e 01 75 0e 44 2b 7c
24 08 44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07
00 00 00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00
75 73 48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: 002: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: 002:  ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: 002:  ?
select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: 002:  ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: 002:  ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: 002:  ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: 002:  ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: 002:  ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: 002:  ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: 002:  do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: 002:  ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: 002:  ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: 002:
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: 002: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: Code: e8 9f dd f8 ff 4c 8b 0c 24
4c 8b 44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b
74 24 18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8
c8 dd f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: 002: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: 002: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: 002: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: 002: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: 002: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: 002: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 02:20:24 vuurmuur kernel: rcu: INFO: rcu_preempt self-detected
stall on CPU
Jul 20 02:20:24 vuurmuur kernel: rcu: #0112-....: (5250 ticks this GP)
idle=5c6/1/0x4000000000000002 softirq=401431806/401431806 fqs=2372
Jul 20 02:20:24 vuurmuur kernel: #011(t=5250 jiffies g=701989125 q=336)
Jul 20 02:20:24 vuurmuur kernel: NMI backtrace for cpu 2
Jul 20 02:20:24 vuurmuur kernel: CPU: 2 PID: 3468730 Comm: ntpd Tainted:
G           O      5.4.13-rt7 #9

Jul 20 02:20:24 vuurmuur kernel: Hardware name: To Be Filled By O.E.M.
To Be Filled By O.E.M./QC5000M-ITX/PH, BIOS P1.10 05/06/2015
Jul 20 02:20:24 vuurmuur kernel: Call Trace:
Jul 20 02:20:24 vuurmuur kernel: <IRQ>
Jul 20 02:20:24 vuurmuur kernel: dump_stack+0x50/0x70
Jul 20 02:20:24 vuurmuur kernel: nmi_cpu_backtrace.cold+0x14/0x53
Jul 20 02:20:24 vuurmuur kernel: ? lapic_can_unplug_cpu.cold+0x3b/0x3b
Jul 20 02:20:24 vuurmuur kernel: nmi_trigger_cpumask_backtrace+0x8e/0xa2
Jul 20 02:20:24 vuurmuur kernel: rcu_dump_cpu_stacks+0x8b/0xb9
Jul 20 02:20:24 vuurmuur kernel: rcu_sched_clock_irq.cold+0x17e/0x4fd
Jul 20 02:20:24 vuurmuur kernel: ? account_system_index_time+0xa6/0xd0
Jul 20 02:20:24 vuurmuur kernel: update_process_times+0x1f/0x50
Jul 20 02:20:24 vuurmuur kernel: tick_sched_timer+0x5a/0x1c0
Jul 20 02:20:24 vuurmuur kernel: ? tick_switch_to_oneshot.cold+0x74/0x74
Jul 20 02:20:24 vuurmuur kernel: __hrtimer_run_queues+0xba/0x1b0
Jul 20 02:20:24 vuurmuur kernel: hrtimer_interrupt+0x108/0x230
Jul 20 02:20:24 vuurmuur kernel: smp_apic_timer_interrupt+0x61/0xa0
Jul 20 02:20:24 vuurmuur kernel: apic_timer_interrupt+0xf/0x20
Jul 20 02:20:24 vuurmuur kernel: </IRQ>
Jul 20 02:20:24 vuurmuur kernel: RIP: 0010:lock_release+0x114/0x1a0
Jul 20 02:20:24 vuurmuur kernel: Code: 0a 50 6e 01 75 0e 44 2b 7c 24 08
44 39 bd a8 07 00 00 75 54 65 48 8b 04 25 00 5e 01 00 c7 80 ac 07 00 00
00 00 00 00 41 56 9d <48> 8b 44 24 10 65 48 33 04 25 28 00 00 00 75 73
48 8b 6c 24 18 4c
Jul 20 02:20:24 vuurmuur kernel: RSP: 0018:ffffc90001ebb8f0 EFLAGS:
00000246 ORIG_RAX: ffffffffffffff13
Jul 20 02:20:24 vuurmuur kernel: RAX: ffff888054294140 RBX:
000000000000001f RCX: ffffc90001ebb8fc
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
ffffffff8203bee0 RDI: ffff8880542948f0
Jul 20 02:20:24 vuurmuur kernel: RBP: ffff888054294140 R08:
0000000000020019 R09: 0000000000000000
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000001 R11:
0000000000000000 R12: ffffffff8203bee0
Jul 20 02:20:24 vuurmuur kernel: R13: ffffffff8117b0a2 R14:
0000000000000246 R15: 0000000000000001
Jul 20 02:20:24 vuurmuur kernel: ? do_select+0x132/0x7a0
Jul 20 02:20:24 vuurmuur kernel: do_select+0x14f/0x7a0
Jul 20 02:20:24 vuurmuur kernel: ? select_estimate_accuracy+0x100/0x100
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? poll_select_finish+0x1d0/0x1d0
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? put_prev_task_rt+0x22/0x140
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? __schedule+0x435/0x4f0
Jul 20 02:20:24 vuurmuur kernel: ? preempt_schedule_irq+0x31/0x50
Jul 20 02:20:24 vuurmuur kernel: ? retint_kernel+0x1b/0x1b
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: core_sys_select+0x1d0/0x380
Jul 20 02:20:24 vuurmuur kernel: ? core_sys_select+0x5c/0x380
Jul 20 02:20:24 vuurmuur kernel: ? tty_ldisc_ref_wait+0x27/0x70
Jul 20 02:20:24 vuurmuur kernel: ? __ldsem_down_read_nested+0x5e/0x240
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? find_held_lock+0x2b/0x80
Jul 20 02:20:24 vuurmuur kernel: ? set_user_sigmask+0x62/0x90
Jul 20 02:20:24 vuurmuur kernel: __x64_sys_pselect6+0x141/0x190
Jul 20 02:20:24 vuurmuur kernel: ? _raw_spin_unlock_irq+0x1f/0x40
Jul 20 02:20:24 vuurmuur kernel: ? sigprocmask+0x6d/0x90
Jul 20 02:20:24 vuurmuur kernel: do_syscall_64+0x77/0x440
Jul 20 02:20:24 vuurmuur kernel: ? nmi_handle+0xc1/0xe0
Jul 20 02:20:24 vuurmuur kernel: ? unregister_nmi_handler+0xb0/0xb0
Jul 20 02:20:24 vuurmuur kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Jul 20 02:20:24 vuurmuur kernel: RIP: 0033:0x7fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: Code: e8 9f dd f8 ff 4c 8b 0c 24 4c 8b
44 24 08 89 c5 4c 8b 54 24 28 48 8b 54 24 20 b8 0e 01 00 00 48 8b 74 24
18 8b 7c 24 14 0f 05 <48> 3d 00 f0 ff ff 77 28 89 ef 89 04 24 e8 c8 dd
f8 ff 8b 04 24 eb
Jul 20 02:20:24 vuurmuur kernel: RSP: 002b:00007ffcff164a10 EFLAGS:
00000293 ORIG_RAX: 000000000000010e
Jul 20 02:20:24 vuurmuur kernel: RAX: ffffffffffffffda RBX:
00005613a62cde78 RCX: 00007fba5ae5a096
Jul 20 02:20:24 vuurmuur kernel: RDX: 0000000000000000 RSI:
00007ffcff164b00 RDI: 000000000000001f
Jul 20 02:20:24 vuurmuur kernel: RBP: 0000000000000000 R08:
0000000000000000 R09: 00007ffcff164a50
Jul 20 02:20:24 vuurmuur kernel: R10: 0000000000000000 R11:
0000000000000293 R12: 00005613a62a9c43
Jul 20 02:20:24 vuurmuur kernel: R13: 0000000000000009 R14:
ffffffffffffffff R15: 00005613a62a9e1c
Jul 20 05:03:01 vuurmuur named[2363288]: received control channel
command 'flush'

Again it appears to concern ntpd.
Still on 5.4.13-rt7.

Should I boot into a newer kernel to verify the issue is still there?

Kind regards,
Udo


On 07-07-2020 19:47, Sebastian Andrzej Siewior wrote:
On 2020-07-04 05:43:32 [+0200], Udo van den Heuvel wrote:
quoted
quoted
I *think* this happened within the loop in __fget_files(). This function
is inlined by __fget_light() and the loop has a RCU-section so it would
make sense.
Do you run something at an elevated priority in the system? 
ntpd (ntpsec) elevates itself.
To the highest possible. Wonderful.
quoted
quoted
I don't know
what the other part was doing but somehow one of the file descriptors
(network sockets probably) was about to be closed while the other side
tried to poll() on it.
Thanks for explaining.
I'm puzzled. On dup() and otherwise on replacement / closing the file
pointer is replaced (with an incremented reference count or NULL ) and
then the reference count of the old file pointer is decremented. So the
zero count could be observed once and then the new pointer should be
seen. So it should not loop for seconds. And then, dup2() is only used
at startup.
  
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help