Re: INFO: rcu detected stall in ext4_write_checks

From: Dmitry Vyukov <dvyukov@google.com>
Date: 2019-07-15 13:29:47
Also in: linux-ext4, lkml

On Mon, Jul 15, 2019 at 3:01 PM Paul E. McKenney [off-list ref] wrote:

On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:

quoted

On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:

quoted

On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:

quoted

On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:

quoted

But short term I don't see any other solution than stop testing
sched_setattr because it does not check arguments enough to prevent
system misbehavior. Which is a pity because syzkaller has found some
bad misconfigurations that were oversight on checking side.
Any other suggestions?

Or maybe syzkaller can put its own limitations on what parameters are
sent to sched_setattr? In practice, there are any number of ways a
root user can shoot themselves in the foot when using sched_setattr or
sched_setaffinity, for that matter. I imagine there must be some such
constraints already --- or else syzkaller might have set a kernel
thread to run with priority SCHED_BATCH, with similar catastrophic
effects --- or do similar configurations to make system threads
completely unschedulable.

Real time administrators who know what they are doing --- and who know
that their real-time threads are well behaved --- will always want to
be able to do things that will be catastrophic if the real-time thread
is *not* well behaved. I don't it is possible to add safety checks
which would allow the kernel to automatically detect and reject unsafe
configurations.

An apt analogy might be civilian versus military aircraft. Most
airplanes are designed to be "inherently stable"; that way, modulo
buggy/insane control systems like on the 737 Max, the airplane will
automatically return to straight and level flight. On the other hand,
some military planes (for example, the F-16, F-22, F-36, the
Eurofighter, etc.) are sometimes designed to be unstable, since that
way they can be more maneuverable.

There are use cases for real-time Linux where this flexibility/power
vs. stability tradeoff is going to argue for giving root the
flexibility to crash the system. Some of these systems might
literally involve using real-time Linux in military applications,
something for which Paul and I have had some experience. :-)

Speaking of sched_setaffinity, one thing which we can do is have
syzkaller move all of the system threads to they run on the "system
CPU's", and then move the syzkaller processes which are testing the
kernel to be on the "system under test CPU's". Then regardless of
what priority the syzkaller test programs try to run themselves at,
they can't crash the system.

Some real-time systems do actually run this way, and it's a
recommended configuration which is much safer than letting the
real-time threads take over the whole system:

http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application

Good point!  We might still have issues with some per-CPU kthreads,
but perhaps use of nohz_full would help at least reduce these sorts
of problems.  (There could still be issues on CPUs with more than
one runnable threads.)

I looked at testing limitations in a bit more detail from an RCU
viewpoint, and came up with the following rough rule of thumb (which of
course might or might not survive actual testing experience, but should at
least be a good place to start).  I believe that the sched_setaffinity()
testing rule should be that the SCHED_DEADLINE cycle be no more than
two-thirds of the RCU CPU stall warning timeout, which defaults to 21
seconds in mainline and 60 seconds in many distro kernels.

That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
testing mainline on the one hand or 40 seconds when testing enterprise
distros on the other.

This assumes quite a bit, though:

o     The system has ample memory to spare, and isn't running a
      callback-hungry workload.  For example, if you "only" have 100MB
      of spare memory and you are also repeatedly and concurrently
      expanding (say) large source trees from tarballs and then deleting
      those source trees, the system might OOM.  The reason OOM might
      happen is that each close() of a file generates an RCU callback,
      and 40 seconds worth of waiting-for-a-grace-period structures
      takes up a surprisingly large amount of memory.

      So please be careful when combining tests.  ;-)

o     There are no aggressive real-time workloads on the system.
      The reason for this is that RCU is going to start sending IPIs
      halfway to the RCU CPU stall timeout, and, in certain situations
      on CONFIG_NO_HZ_FULL kernels, much earlier.  (These situations
      constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
      calibrated abuse is what stress testing is all about.)

o     The various RCU kthreads will get a chance to run at least once
      during the SCHED_DEADLINE cycle.  If in real life, they only
      get a chance to run once per two SCHED_DEADLINE cycles, then of
      course the 14 seconds becomes 7 and the 40 seconds becomes 20.

And there are configurations and workloads that might require division
by three, so that (assuming one chance to run per cycle), the 14 seconds
becomes about 5 and the 40 seconds becomes about 15.

quoted

o     The current RCU CPU stall warning defaults remain in
      place.  These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
      Kconfig parameter, which may in turn be overridden by the
      rcupdate.rcu_cpu_stall_timeout kernel boot parameter.

o     The current SCHED_DEADLINE default for providing spare cycles
      for other uses remains in place.

o     Other kthreads might have other constraints, but given that you
      were seeing RCU CPU stall warnings instead of other failures,
      the needs of RCU's kthreads seem to be a good place to start.

Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
cycle be no more than 14 seconds when testing mainline kernels on the one
hand and 40 seconds when testing enterprise distro kernels on the other.

Dmitry, does that help?

I checked with the people running the Linux Plumbers Conference Scheduler
Microconference, and they said that they would welcome a proposal on
this topic, which I have submitted (please see below).  Would anyone
like to join as co-conspirator?

                                                        Thanx, Paul

------------------------------------------------------------------------

Title: Making SCHED_DEADLINE safe for kernel kthreads

Abstract:

Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr()
that can result in SCHED_DEADLINE tasks starving RCU's kthreads for
extended time periods, not millisecond, not seconds, not minutes, not even
hours, but days. Given that RCU CPU stall warnings are issued whenever
an RCU grace period fails to complete within a few tens of seconds,
the system did not suffer silently. Although one could argue that people
should avoid abusing sched_setattr(), people are human and humans make
mistakes. Responding to simple mistakes with RCU CPU stall warnings is
all well and good, but a more severe case could OOM the system, which
is a particularly unhelpful error message.

It would be better if the system were capable of operating reasonably
despite such abuse. Several approaches have been suggested.

First, sched_setattr() could recognize parameter settings that put
kthreads at risk and refuse to honor those settings. This approach
of course requires that we identify precisely what combinations of
sched_setattr() parameters settings are risky, especially given that there
are likely to be parameter settings that are both risky and highly useful.

Second, in theory, RCU could detect this situation and take the "dueling
banjos" approach of increasing its priority as needed to get the CPU time
that its kthreads need to operate correctly. However, the required amount
of CPU time can vary greatly depending on the workload. Furthermore,
non-RCU kthreads also need some amount of CPU time, and replicating
"dueling banjos" across all such Linux-kernel subsystems seems both
wasteful and error-prone. Finally, experience has shown that setting
RCU's kthreads to real-time priorities significantly harms performance
by increasing context-switch rates.

Third, stress testing could be limited to non-risky regimes, such that
kthreads get CPU time every 5-40 seconds, depending on configuration
and experience. People needing risky parameter settings could then test
the settings that they actually need, and also take responsibility for
ensuring that kthreads get the CPU time that they need. (This of course
includes per-CPU kthreads!)

Fourth, bandwidth throttling could treat tasks in other scheduling classes
as an aggregate group having a reasonable aggregate deadline and CPU
budget. This has the advantage of allowing "abusive" testing to proceed,
which allows people requiring risky parameter settings to rely on this
testing. Additionally, it avoids complex progress checking and priority
setting on the part of many kthreads throughout the system. However,
if this was an easy choice, the SCHED_DEADLINE developers would likely
have selected it. For example, it is necessary to determine what might
be a "reasonable" aggregate deadline and CPU budget. Reserving 5%
seems quite generous, and RCU's grace-period kthread would optimally
like a deadline in the milliseconds, but would do reasonably well with
many tens of milliseconds, and absolutely needs a few seconds. However,
for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might
well need a full CPU each! (This happens when the CPU being offloaded
generates a high rate of callbacks.)

The goal of this proposal is therefore to generate face-to-face
discussion, hopefully resulting in a good and sufficient solution to
this problem.


I would be happy to attend if this won't conflict with important
things on the testing and fuzzing MC.

If we restrict arguments for sched_attr, what would be the criteria
for 100% safe arguments? Moving the check from kernel to user-space
does not relief us from explicitly stating the condition in
black-and-white way. All of sched_runtime/sched_deadline/sched_period
be not larger than 1 second?

The problem is that syzkaller does not allow 100% reliable enforcement
for indirect arguments in memory. E.g. inputs arguments can overlap,
input/output can overlap, weird races affect what's actually being
passed to kernel, the memory being mapped from a weird device, etc.
And that's also useful as it can discover TOCTOU bugs, deadlocks, etc.
We could try to wrap sched_setattr and do some additional restrictions
by giving up on TOCTOU, device-mapped memory, etc.

I am also thinking about dropping CAP_SYS_NICE, it should still allow
some configurations, but no inherently unsafe ones.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help