Re: INFO: rcu detected stall in ext4_write_checks
From: Dmitry Vyukov <dvyukov@google.com>
Date: 2019-07-15 13:29:47
Also in:
linux-ext4, lkml
On Mon, Jul 15, 2019 at 3:01 PM Paul E. McKenney [off-list ref] wrote:
On Sun, Jul 14, 2019 at 08:10:27PM -0700, Paul E. McKenney wrote:quoted
On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:quoted
On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:quoted
On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:quoted
But short term I don't see any other solution than stop testing sched_setattr because it does not check arguments enough to prevent system misbehavior. Which is a pity because syzkaller has found some bad misconfigurations that were oversight on checking side. Any other suggestions?Or maybe syzkaller can put its own limitations on what parameters are sent to sched_setattr? In practice, there are any number of ways a root user can shoot themselves in the foot when using sched_setattr or sched_setaffinity, for that matter. I imagine there must be some such constraints already --- or else syzkaller might have set a kernel thread to run with priority SCHED_BATCH, with similar catastrophic effects --- or do similar configurations to make system threads completely unschedulable. Real time administrators who know what they are doing --- and who know that their real-time threads are well behaved --- will always want to be able to do things that will be catastrophic if the real-time thread is *not* well behaved. I don't it is possible to add safety checks which would allow the kernel to automatically detect and reject unsafe configurations. An apt analogy might be civilian versus military aircraft. Most airplanes are designed to be "inherently stable"; that way, modulo buggy/insane control systems like on the 737 Max, the airplane will automatically return to straight and level flight. On the other hand, some military planes (for example, the F-16, F-22, F-36, the Eurofighter, etc.) are sometimes designed to be unstable, since that way they can be more maneuverable. There are use cases for real-time Linux where this flexibility/power vs. stability tradeoff is going to argue for giving root the flexibility to crash the system. Some of these systems might literally involve using real-time Linux in military applications, something for which Paul and I have had some experience. :-) Speaking of sched_setaffinity, one thing which we can do is have syzkaller move all of the system threads to they run on the "system CPU's", and then move the syzkaller processes which are testing the kernel to be on the "system under test CPU's". Then regardless of what priority the syzkaller test programs try to run themselves at, they can't crash the system. Some real-time systems do actually run this way, and it's a recommended configuration which is much safer than letting the real-time threads take over the whole system: http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_ApplicationGood point! We might still have issues with some per-CPU kthreads, but perhaps use of nohz_full would help at least reduce these sorts of problems. (There could still be issues on CPUs with more than one runnable threads.)I looked at testing limitations in a bit more detail from an RCU viewpoint, and came up with the following rough rule of thumb (which of course might or might not survive actual testing experience, but should at least be a good place to start). I believe that the sched_setaffinity() testing rule should be that the SCHED_DEADLINE cycle be no more than two-thirds of the RCU CPU stall warning timeout, which defaults to 21 seconds in mainline and 60 seconds in many distro kernels. That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when testing mainline on the one hand or 40 seconds when testing enterprise distros on the other. This assumes quite a bit, though: o The system has ample memory to spare, and isn't running a callback-hungry workload. For example, if you "only" have 100MB of spare memory and you are also repeatedly and concurrently expanding (say) large source trees from tarballs and then deleting those source trees, the system might OOM. The reason OOM might happen is that each close() of a file generates an RCU callback, and 40 seconds worth of waiting-for-a-grace-period structures takes up a surprisingly large amount of memory. So please be careful when combining tests. ;-) o There are no aggressive real-time workloads on the system. The reason for this is that RCU is going to start sending IPIs halfway to the RCU CPU stall timeout, and, in certain situations on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully calibrated abuse is what stress testing is all about.) o The various RCU kthreads will get a chance to run at least once during the SCHED_DEADLINE cycle. If in real life, they only get a chance to run once per two SCHED_DEADLINE cycles, then of course the 14 seconds becomes 7 and the 40 seconds becomes 20.And there are configurations and workloads that might require division by three, so that (assuming one chance to run per cycle), the 14 seconds becomes about 5 and the 40 seconds becomes about 15.quoted
o The current RCU CPU stall warning defaults remain in place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT Kconfig parameter, which may in turn be overridden by the rcupdate.rcu_cpu_stall_timeout kernel boot parameter. o The current SCHED_DEADLINE default for providing spare cycles for other uses remains in place. o Other kthreads might have other constraints, but given that you were seeing RCU CPU stall warnings instead of other failures, the needs of RCU's kthreads seem to be a good place to start. Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE cycle be no more than 14 seconds when testing mainline kernels on the one hand and 40 seconds when testing enterprise distro kernels on the other. Dmitry, does that help?I checked with the people running the Linux Plumbers Conference Scheduler Microconference, and they said that they would welcome a proposal on this topic, which I have submitted (please see below). Would anyone like to join as co-conspirator? Thanx, Paul ------------------------------------------------------------------------ Title: Making SCHED_DEADLINE safe for kernel kthreads Abstract: Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods, not millisecond, not seconds, not minutes, not even hours, but days. Given that RCU CPU stall warnings are issued whenever an RCU grace period fails to complete within a few tens of seconds, the system did not suffer silently. Although one could argue that people should avoid abusing sched_setattr(), people are human and humans make mistakes. Responding to simple mistakes with RCU CPU stall warnings is all well and good, but a more severe case could OOM the system, which is a particularly unhelpful error message. It would be better if the system were capable of operating reasonably despite such abuse. Several approaches have been suggested. First, sched_setattr() could recognize parameter settings that put kthreads at risk and refuse to honor those settings. This approach of course requires that we identify precisely what combinations of sched_setattr() parameters settings are risky, especially given that there are likely to be parameter settings that are both risky and highly useful. Second, in theory, RCU could detect this situation and take the "dueling banjos" approach of increasing its priority as needed to get the CPU time that its kthreads need to operate correctly. However, the required amount of CPU time can vary greatly depending on the workload. Furthermore, non-RCU kthreads also need some amount of CPU time, and replicating "dueling banjos" across all such Linux-kernel subsystems seems both wasteful and error-prone. Finally, experience has shown that setting RCU's kthreads to real-time priorities significantly harms performance by increasing context-switch rates. Third, stress testing could be limited to non-risky regimes, such that kthreads get CPU time every 5-40 seconds, depending on configuration and experience. People needing risky parameter settings could then test the settings that they actually need, and also take responsibility for ensuring that kthreads get the CPU time that they need. (This of course includes per-CPU kthreads!) Fourth, bandwidth throttling could treat tasks in other scheduling classes as an aggregate group having a reasonable aggregate deadline and CPU budget. This has the advantage of allowing "abusive" testing to proceed, which allows people requiring risky parameter settings to rely on this testing. Additionally, it avoids complex progress checking and priority setting on the part of many kthreads throughout the system. However, if this was an easy choice, the SCHED_DEADLINE developers would likely have selected it. For example, it is necessary to determine what might be a "reasonable" aggregate deadline and CPU budget. Reserving 5% seems quite generous, and RCU's grace-period kthread would optimally like a deadline in the milliseconds, but would do reasonably well with many tens of milliseconds, and absolutely needs a few seconds. However, for CONFIG_RCU_NOCB_CPU=y, the RCU's callback-offload kthreads might well need a full CPU each! (This happens when the CPU being offloaded generates a high rate of callbacks.) The goal of this proposal is therefore to generate face-to-face discussion, hopefully resulting in a good and sufficient solution to this problem.
I would be happy to attend if this won't conflict with important things on the testing and fuzzing MC. If we restrict arguments for sched_attr, what would be the criteria for 100% safe arguments? Moving the check from kernel to user-space does not relief us from explicitly stating the condition in black-and-white way. All of sched_runtime/sched_deadline/sched_period be not larger than 1 second? The problem is that syzkaller does not allow 100% reliable enforcement for indirect arguments in memory. E.g. inputs arguments can overlap, input/output can overlap, weird races affect what's actually being passed to kernel, the memory being mapped from a weird device, etc. And that's also useful as it can discover TOCTOU bugs, deadlocks, etc. We could try to wrap sched_setattr and do some additional restrictions by giving up on TOCTOU, device-mapped memory, etc. I am also thinking about dropping CAP_SYS_NICE, it should still allow some configurations, but no inherently unsafe ones.