Re: INFO: rcu detected stall in ext4_write_checks

From: Paul E. McKenney <hidden>
Date: 2019-07-07 01:17:06
Also in: linux-ext4, lkml

On Sat, Jul 06, 2019 at 11:03:11AM -0700, Paul E. McKenney wrote:

On Sat, Jul 06, 2019 at 11:02:26AM -0400, Theodore Ts'o wrote:

quoted

On Fri, Jul 05, 2019 at 11:16:31PM -0700, Paul E. McKenney wrote:

quoted

I suppose RCU could take the dueling-banjos approach and use increasingly
aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
until it started getting decent forward progress.  However, that
sounds like the something that just might have unintended consequences,
particularly if other kernel subsystems were to also play similar
games of dueling banjos.

So long as the RCU threads are well-behaved, using SCHED_DEADLINE
shouldn't have much of an impact on the system --- and the scheduling
parameters that you can specify on SCHED_DEADLINE allows you to
specify the worst-case impact on the system while also guaranteeing
that the SCHED_DEADLINE tasks will urn in the first place.  After all,
that's the whole point of SCHED_DEADLINE.

So I wonder if the right approach is during the the first userspace
system call to shced_setattr to enable a (any) real-time priority
scheduler (SCHED_DEADLINE, SCHED_FIFO or SCHED_RR) on a userspace
thread, before that's allowed to proceed, the RCU kernel threads are
promoted to be SCHED_DEADLINE with appropriately set deadline
parameters.  That way, a root user won't be able to shoot the system
in the foot, and since the vast majority of the time, there shouldn't
be any processes running with real-time priorities, we won't be
changing the behavior of a normal server system.

It might well be.  However, running the RCU kthreads at real-time
priority does not come for free.  For example, it tends to crank up the
context-switch rate.

Plus I have taken several runs at computing SCHED_DEADLINE parameters,
but things like the rcuo callback-offload threads have computational
requirements that are controlled not by RCU, and not just by the rest of
the kernel, but also by userspace (keeping in mind the example of opening
and closing a file in a tight loop, each pass of which queues a callback).
I suspect that RCU is not the only kernel subsystem whose computational
requirements are set not by the subsystem, but rather by external code.

OK, OK, I suppose I could just set insanely large SCHED_DEADLINE
parameters, following syzkaller's example, and then trust my ability to
keep the RCU code from abusing the resulting awesome power.  But wouldn't
a much nicer approach be to put SCHED_DEADLINE between SCHED_RR/SCHED_FIFO
priorities 98 and 99 or some such?  Then the same (admittedly somewhat
scary) result could be obtained much more simply via SCHED_FIFO or
SCHED_RR priority 99.

Some might argue that this is one of those situations where simplicity
is not necessarily an advantage, but then again, you can find someone
who will complain about almost anything.  ;-)

quoted

(I suspect there might be some audio applications that might try to
set real-time priorities, but for desktop systems, it's probably more
important that the system not tie its self into knots since the
average desktop user isn't going to be well equipped to debug the
problem.)

Not only that, but if core counts continue to increase, and if reliance
on cloud computing continues to grow, there are going to be an increasing
variety of mixed workloads in increasingly less-controlled environments.

So, yes, it would be good to solve this problem in some reasonable way.

I don't see this as urgent just yet, but I am sure you all will let
me know if I am mistaken on that point.

quoted

Alternatively, is it possible to provide stricter admission control?

I think that's an orthogonal issue; better admission control would be
nice, but it looks to me that it's going to be fundamentally an issue
of tweaking hueristics, and a fool-proof solution that will protect
against all malicious userspace applications (including syzkaller) is
going to require solving the halting problem.  So while it would be
nice to improve the admission control, I don't think that's a going to
be a general solution.

Agreed, and my earlier point about the need to trust the coding abilities
of those writing ultimate-priority code is all too consistent with your
point about needing to solve the halting problem.  Nevertheless,  I believe
that we could make something that worked reasonably well in practice.

Here are a few components of a possible solution, in practice, but
of course not in theory:

1.	We set limits to SCHED_DEADLINE parameters, perhaps novel ones.
	For one example, insist on (say) 10 milliseconds of idle time
	every second on each CPU.  Yes, you can configure beyond that
	given sufficient permissions, but if you do so, you just voided
	your warranty.

2.	Only allow SCHED_DEADLINE on nohz_full CPUs.  (Partial solution,
	given that such a CPU might be running in the kernel or have
	more than one runnable task.  Just for fun, I will suggest the
	option of disabling SCHED_DEADLINE during such times.)

3.	RCU detects slowdowns, and does something TBD to increase its
	priority, but only while the slowdown persists.  This likely
	relies on scheduling-clock interrupts to detect the slowdowns,
	so there might be additional challenges on a fully nohz_full
	system.

4.	SCHED_DEADLINE treats the other three scheduling classes as each
	having a period, deadline, and a modest CPU consumption budget
	for the members of the class in aggregate.  But this has to have
	been discussed before.  How did that go?

5.	Your idea here.

							Thanx, Paul

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help