Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

From: Tobias Huschle <hidden>
Date: 2024-03-19 09:08:30
Also in: lkml

On 2024-03-18 15:45, Luis Machado wrote:

On 3/14/24 13:45, Tobias Huschle wrote:

quoted

On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:

quoted

On 2/28/24 16:10, Tobias Huschle wrote:

quoted

Questions:
1. The kworker getting its negative lag occurs in the following 
scenario
   - kworker and a cgroup are supposed to execute on the same CPU
   - one task within the cgroup is executing and wakes up the 
kworker
   - kworker with 0 lag, gets picked immediately and finishes its
     execution within ~5000ns
   - on dequeue, kworker gets assigned a negative lag
   Is this expected behavior? With this short execution time, I 
would
   expect the kworker to be fine.

That strikes me as a bit odd as well. Have you been able to determine 
how a negative lag
is assigned to the kworker after such a short runtime?

I did some more trace reading though and found something.

What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
  - vhost starts alone on an otherwise empty runqueue
  - it seems like it never gets dequeued
    (unless another unrelated task joins or migration hits)
  - if vhost wakes up the kworker, the kworker gets selected
  - vhost runtime > kworker runtime
    --> kworker gets positive lag and gets selected immediately next 
time

What happens if it does go wrong:
From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If 
these
outliers reach critical values, it can happen, that
   vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.

In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than 
usual
to finish. We speak of 4-digit to low 5-digit nanoseconds.

So, for these outliers, the scheduler extrapolates that the kworker
out-consumes the vhost and should be slowed down, although in the 
majority
of other cases this does not happen.

Thanks for providing the above details Tobias. It does seem like EEVDF 
is strict
about the eligibility checks and making tasks wait when their lags are
negative, even
if just a little bit as in the case of the kworker.

There was a patch to disable the eligibility checks
(https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/ (local)),
which would make EEVDF more like EVDF, though the deadline comparison 
would
probably still favor the vhost task instead of the kworker with the
negative lag.

I'm not sure if you tried it, but I thought I'd mention it.

Haven't seen that one yet. Unfortunately, it does not help to ignore the 
eligibility.

I'm inclined to rather propose propose a documentation change, which 
describes that tasks should not rely on woken up tasks being scheduled 
immediately.

Changing things in the code to address for the specific scenario I'm 
seeing seems to mostly create unwanted side effects and/or would require 
the definition of some magic cut-off values.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help