Re: [PATCH] sched: fix group_entity's share update
From: Peter Zijlstra <peterz@infradead.org>
Date: 2016-12-15 21:42:32
Also in:
lkml
On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task TA that is dequeued of its task group TG1.
TA was the only task in TG1 which becomes idle.
We have the sequence:
- dequeue_entity TA->se
- update_load_avg(TA->se)
- dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
- account_entity_dequeue(TG1->cfs_rq, TA->se)
TG1->cfs_rq->load.weight = 0
- update_cfs_shares(TG1->cfs_rq)
TG1->se->load.weight is updated with the new share of
cfs_rq. TG1->se->load.weight = 0.
- dequeue_entity TG1->se
- update_load_avg(TG1->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with its new weight (0 in our case)
instead of its real weight. The last time slot is accounted as an idle one
whereas it was a running one.
If the running time of TA is short enough that no tick happens when it
runs, all running time of TG1->se will be accounted as idle time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as parameter instead of the
cfs_rq and the weight of the group_entity is updated only once its load_avg
has been synced with current time.Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/ So the problem is that in our for_each_sched_entity(se) loop we end up changing the next se before we get there. root (cfs_rq) \ (se) A (cfs_rq) \ (se) a Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then updates A's se, which is the next se in our iteration and mucks with state before we get there. So you change update_cfs_shares() to go downward while we go upward, ensuring we only update things that we've finished with. Makes sense..
quoted hunk ↗ jump to hunk
kernel/sched/fair.c | 27 ++++++++++++++++----------- 1 file changed, 16 insertions(+), 11 deletions(-)diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18d9e75..19092fa 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c@@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); -static void update_cfs_shares(struct cfs_rq *cfs_rq) +static void update_cfs_shares(struct sched_entity *se) { struct task_group *tg; - struct sched_entity *se; + struct cfs_rq *cfs_rq = group_cfs_rq(se); long shares;
please keep them ordered by length.
+ if (entity_is_task(se))
can be: !cfs_rq, which is the same and we already done that load.
+ return; + tg = cfs_rq->tg;
This load isn't needed here yet, can be moved down a bit.
- se = tg->se[cpu_of(rq_of(cfs_rq))]; - if (!se || throttled_hierarchy(cfs_rq)) + + if (throttled_hierarchy(cfs_rq)) return; #ifndef CONFIG_SMP if (likely(se->load.weight == tg->shares))
quoted hunk ↗ jump to hunk
@@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->vruntime += cfs_rq->min_vruntime; update_load_avg(se, UPDATE_TG); + update_cfs_shares(se); enqueue_entity_load_avg(cfs_rq, se); account_entity_enqueue(cfs_rq, se); - update_cfs_shares(cfs_rq); if (flags & ENQUEUE_WAKEUP) place_entity(cfs_rq, se, 0);
So here we need to update_cfs_shares() _before_ enqueue_entity, because the update_cfs_shares() will affect this se's load, right?
quoted hunk ↗ jump to hunk
@@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) /* return excess runtime on last dequeue */ return_cfs_rq_runtime(cfs_rq); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); /* * Now advance min_vruntime if @se was the entity holding it back,
But this one hurts my brain.. It must be done after dequeue_entity_load_avg() such that we subtract the load as was seen until now. Could we please add comments explaining this ordering, because I forever need to think about this (both enqueue and dequeue).
quoted hunk ↗ jump to hunk
@@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) * Ensure that runnable average is periodically updated. */ update_load_avg(curr, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(curr); #ifdef CONFIG_SCHED_HRTICK /*@@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) break; update_load_avg(se, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); } if (!se)@@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) break; update_load_avg(se, UPDATE_TG); - update_cfs_shares(cfs_rq); + update_cfs_shares(se); } if (!se)
This has a distinct pattern to it though; should we think about something like: UPDATE_SHARES for update_load_avg() or does that confuse things?
quoted hunk ↗ jump to hunk
@@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) /* Possible calls to update_curr() need rq clock */ update_rq_clock(rq); for_each_sched_entity(se) - update_cfs_shares(group_cfs_rq(se)); + update_cfs_shares(se);
Should we not also catch up with our load before we frob the shares?
raw_spin_unlock_irqrestore(&rq->lock, flags); }