Re: [PATCH v2] sched: Warn on long periods of pending need_resched

From: Peter Zijlstra <peterz@infradead.org>
Date: 2021-03-24 14:37:07
Also in: linux-fsdevel, lkml

On Wed, Mar 24, 2021 at 01:39:16PM +0000, Mel Gorman wrote:

quoted

Yeah, lets say I was pleasantly surprised to find it there :-)

Minimally, lets move that out before it gets kicked out. Patch below.

OK, stuck that in front.

quoted

Moving something like sched_min_granularity_ns will break a number of
tuning guides as well as the "tuned" tool which ships by default with
some distros and I believe some of the default profiles used for tuned
tweak kernel.sched_min_granularity_ns

Yeah, can't say I care. I suppose some people with PREEMPT=n kernels
increase that to make their server workloads 'go fast'. But I'll
absolutely suck rock on anything desktop.

Broadly speaking yes and despite the lack of documentation, enough people
think of that parameter when tuning for throughput vs latency depending on
the expected use of the machine.  kernel.sched_wakeup_granularity_ns might
get tuned if preemption is causing overscheduling. Same potentially with
kernel.sched_min_granularity_ns and kernel.sched_latency_ns. That said, I'm
struggling to think of an instance where I've seen tuning recommendations
properly quantified other than the impact on microbenchmarks but I
think there will be complaining if they disappear. I suspect that some
recommended tuning is based on "I tried a number of different values and
this seemed to work reasonably well".

Right, except that due to that scaling thing, you'd have to re-evaluate
when you change machine.

Also, do you have any inclination on the perf difference we're talking
about? (I should probably ask Google and not you...)

kernel.sched_schedstats probably should not depend in SCHED_DEBUG because
it has value for workload analysis which is not necessarily about debugging
per-se. It might simply be informing whether another variable should be
tuned or useful for debugging applications rather than the kernel.

Dubious, if you're that far down the rabit hole, you're dang near
debugging.

As an aside, I wonder how often SCHED_DEBUG has been enabled simply
because LATENCYTOP selects it -- no idea offhand why LATENCYTOP even
needs SCHED_DEBUG.

Perhaps schedstats used to rely on debug? I can't remember. I don't
think I've used latencytop in at least 10 years. ftrace and perf sorta
killed the need for it.

quoted

These knobs really shouldn't have been as widely available as they are.

Probably not. Worse, some of the tuning is probably based on "this worked
for workload X 10 years ago so I'll just keep doing that"

That sounds like an excellent reason to disrupt ;-)

quoted

And guides, well, the writes have to earn a living too, right.

For most of the guides I've seen they either specify values without
explaining why or just describe roughly what the parameter does and it's
not always that accurate a description.

Another good reason.

quoted

Whether there are legimiate reasons to modify those values or not,
removing them may generate fun bug reports.

Which I'll close with -EDONTCARE, userspace has to cope with
SCHED_DEBUG=n in any case.

True but removing the throughput vs latency parameters is likely to
generate a lot of noise even if the reasons for tuning are bad ones.
Some definitely should not be depending on SCHED_DEBUG, others may
need to be moved to debugfs one patch at a time so they can be reverted
individually if complaining is excessive and there is a legiminate reason
why it should be tuned. It's possible that complaining will be based on
a workload regression that really depended on tuned changing parameters.

The way I've done it, you can simply re-instate the systl table entry
and it'll work again, except for the entries that had a custom handler.

I'm ready to disrupt :-)

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help