Re: RCU stalls on heavy socket traffic

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: 2015-03-09 16:39:21
Also in: lkml

* Paul E. McKenney | 2015-03-09 08:59:47 [-0700]:

On Mon, Mar 09, 2015 at 04:26:51PM +0100, Sebastian Andrzej Siewior wrote:

quoted

So I run "hackbench -g 600 -l 350 -s 250" which takes approx 77 seconds
to complete. Then this popped up:

|INFO: rcu_preempt detected stalls on CPUs/tasks:
| Tasks blocked on level-0 rcu_node (CPUs 0-3): P24858 P28514 P25185 P25184 P28713 P19549 P3139 P25275 P28474 P29062 P6703 P10 106 P14309 P27910 P4514 P14834 P28385 P21073 P27701 P642 P10340 P16939 P19147 P16949 P16945 P16952 P16941 P16937 P16938 P16946 P16954 P28664  P18701 P17782 P4875 P8873
| Tasks blocked on level-0 rcu_node (CPUs 0-3): P24858 P28514 P25185 P25184 P28713 P19549 P3139 P25275 P28474 P29062 P6703 P10 106 P14309 P27910 P4514 P14834 P28385 P21073 P27701 P642 P10340 P16939 P19147 P16949 P16945 P16952 P16941 P16937 P16938 P16946 P16954 P28664  P18701 P17782 P4875 P8873

Definitely lots of tasks recently preempted within RCU read-side
critical sections!

For example, the P24858 says that the task with PID 24858 was preempted
(or blocked) while within an RCU read-side critical section that blocks
the current grace period, and that this task has not yet exited that
critical section.  This might be because that task has been preempted
throughout by a higher-priority task, that there is an infinite loop in
that critical section, or that the task is blocked on a spinlock that
is being held indefinitely by some other task (possibly because that
other task is being indefinitely preempted).

From what it looks in the backtrace of each task is that it got
preempted within the RCU read section (once hackbach is done, everything
gets back to normal). Nothing loops indefinitely. Once hackbench is done
with sending messages, everything gets back to normal. I'm just not sure
if everything in terms of RCU works okay on -RT or of something is
missing. But as you suggested that it is rcurc/$CPU job to do this, I
will check if this process runs…

quoted

| (detected by 0, t=5252 jiffies, g=2385, c=2384, q=41180)
|hackbench       R  running task        0 24858   1995 0x00000000
| ffff880058d9bb18 0000000000000082 0000000000012e40 ffff880058d9bfd8
| 0000000000012e40 ffff8804226627c0 ffff880058d9bb38 ffff880058d9bfd8
| 0000000000000001 ffff880058d5ccf8 0000000000000292 0000000000000002
|Call Trace:
| [<ffffffff81568e9f>] preempt_schedule+0x3f/0x60
| [<ffffffff812d84d7>] ___preempt_schedule+0x35/0x67
| [<ffffffff8156b725>] ? _raw_spin_unlock_irqrestore+0x25/0x30
| [<ffffffff81088503>] try_to_wake_up+0x63/0x2f0
| [<ffffffff8108881d>] default_wake_function+0xd/0x10
| [<ffffffff8109b011>] autoremove_wake_function+0x11/0x40
| [<ffffffff8109aa05>] __wake_up_common+0x55/0x90
| [<ffffffff8109afd3>] __wake_up_sync_key+0x43/0x60
| [<ffffffff8145c95e>] sock_def_readable+0x3e/0x70
| [<ffffffff8150f9d1>] unix_stream_sendmsg+0x211/0x470
| [<ffffffff81458b48>] sock_aio_write+0xf8/0x120
| [<ffffffff8109ec19>] ? rt_up_read+0x19/0x20
| [<ffffffff8116e855>] do_sync_write+0x55/0x90
| [<ffffffff8116f435>] vfs_write+0x175/0x1f0
| [<ffffffff8116fde4>] SyS_write+0x44/0xb0
| [<ffffffff8156c1ed>] system_call_fastpath+0x16/0x1b

The other processes look more or less the same. I have the full splat
here [0]. My understanding is that sock_def_readable() does a
rcu_read_lock() which forbids a grace period. Since there are many
processes (preempted) in this section, the grace period never starts
since it never blocks new readers from getting into a read critical
section.
Is my understanding correct so far? Is it likely that -RT does not do
something correctly to forbid such a situation or is it more or less
"expected" ?

[0] https://breakpoint.cc/rt-rcu-stall.txt

If you don't have CONFIG_RCU_BOOST=y with priority set above that of
any CPU-bound components of your workload, then yes, this is expected
behavior.  The priority boosting will be inherited, so that this covers
the case of indefinite preemption.  Of course, this won't help with
infinite loops.

I have
 CONFIG_RCU_BOOST=y
 CONFIG_RCU_BOOST_PRIO=6

and hackbach is a normal user process. How does this suppose to work? I
have five tasks preempted in RCU read section. Then rcuc gets scheduled
somehow. It boosts the five tasks one after the other until they get out
of the RCU section. And since nothing has a higher priority than rcuc
nothing can get on the CPU and into the RCU section again.
What happens if the task got preempted because it was waiting for
something? Say for a lock which was hold by another task on another CPU?
It would then boost the task on the other CPU while this CPU would
become "idle" and allow a task with a lower priority to get into RCU
read section again.
rcuc gets on the CPU again after the 5 tasks left the critical section
but a new make it into the section. Is it possible and if so what would
it do then?

That said, someone was having trouble with the rcuoc kthreads that are
generated by CONFIG_RCU_BOOST=y when running NO_HZ_FULL.  So there might
need to be different kernel configs for different workloads.  Or we might
need some way to enable and disable boosting at boot time, I suppose.

I think I saw this with CONFIG_NO_HZ_IDLE, too.

						Thanx, Paul

Sebastian

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help