Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: 2020-06-11 12:01:26
Also in:
linux-fsdevel, lkml
On Thu, 11 Jun 2020 at 12:24, Qais Yousef [off-list ref] wrote:
On 06/09/20 19:10, Vincent Guittot wrote:quoted
On Mon, 8 Jun 2020 at 14:31, Qais Yousef [off-list ref] wrote:quoted
On 06/04/20 14:14, Vincent Guittot wrote: [...]quoted
I have tried your patch and I don't see any difference compared to previous tests. Let me give you more details of my setup: I create 3 levels of cgroups and usually run the tests in the 4 levels (which includes root). The result above are for the root level But I see a difference at other levels: root level 1 level 2 level 3 /w patch uclamp disable 50097 46615 43806 41078 tip uclamp enable 48706(-2.78%) 45583(-2.21%) 42851(-2.18%) 40313(-1.86%) /w patch uclamp enable 48882(-2.43%) 45774(-1.80%) 43108(-1.59%) 40667(-1.00%) Whereas tip with uclamp stays around 2% behind tip without uclamp, the diff of uclamp with your patch tends to decrease when we increase the number of levelSo I did try to dig more into this, but I think it's either not a good reproducer or what we're observing here is uArch level latencies caused by the new code that seem to produce a bigger knock on effect than what they really are. First, CONFIG_FAIR_GROUP_SCHED is 'expensive', for some definition of expensive..yes, enabling CONFIG_FAIR_GROUP_SCHED adds an overheadquoted
*** uclamp disabled/fair group enabled *** # Executed 50000 pipe operations between two threads Total time: 0.958 [sec] 19.177100 usecs/op 52145 ops/sec *** uclamp disabled/fair group disabled *** # Executed 50000 pipe operations between two threads Total time: 0.808 [sec] 16.176200 usecs/op 61819 ops/sec So there's a 15.6% drop in ops/sec when enabling this option. I think it's good to look at the absolutely number of usecs/op, Fair group adds around 3 usecs/op. I dropped FAIR_GROUP_SCHED from my config to eliminate this overhead and focus on solely on uclamp overhead.Have you checked that both tests run at the root level ?I haven't actively moved tasks to cgroups. As I said that snippet was particularly bad and I didn't see that level of nesting in every call.quoted
Your function-graph log below shows several calls to update_cfs_group() which means that your trace below has not been made at root level but most probably at the 3rd level and I wonder if you used the same setup for running the benchmark above. This could explain such huge difference because I don't have such difference on my platform but more around 2%What promoted me to look at this is when you reported that even without uclamp the nested cgroup showed a drop at each level. I was just trying to understand how both affect the hot path in hope to understand the root cause of uclamp overhead.quoted
For uclamp disable/fair group enable/ function graph enable : 47994ops/sec For uclamp disable/fair group disable/ function graph enable : 49107ops/secquoted
With uclamp enabled but no fair group I get *** uclamp enabled/fair group disabled *** # Executed 50000 pipe operations between two threads Total time: 0.856 [sec] 17.125740 usecs/op 58391 ops/sec The drop is 5.5% in ops/sec. Or 1 usecs/op. I don't know what's the expectation here. 1 us could be a lot, but I don't think we expect the new code to take more than few 100s of ns anyway. If you add potential caching effects, reaching 1 us wouldn't be that hard. Note that in my runs I chose performance governor and use `taskset 0x2` toYou might want to set 2 CPUs in your cpumask instead of 1 in order to have 1 CPU for each threadI did try that but it didn't seem to change the number. I think the 2 tasks interleave so running in 2 CPUs doesn't change the result. But to ease ftrace capture, it's easier to monitor a single cpu.quoted
quoted
force running on a big core to make sure the runs are repeatable.I also use performance governor but don't pinned tasks because I use smp.Is your arm platform SMP?
Yes, all my tests are done on the Arm64 octo core smp system
quoted
quoted
On Juno-r2 I managed to scrap most of the 1 us with the below patch. It seems there was weird branching behavior that affects the I$ in my case. It'd be good to try it out to see if it makes a difference for you.The perf are slightly worse on my setup: For uclamp enable/fair group disable/ function graph enable : 48413ops/sec with patch below : 47804os/secI am not sure if the new code could just introduce worse cache performance in a platform dependent way. The evidences I have so far point in this direction.quoted
quoted
The I$ effect is my best educated guess. Perf doesn't catch this path and I couldn't convince it to look at cache and branch misses between 2 specific points. Other subtle code shuffling did have weird effect on the result too. One worthy one is making uclamp_rq_dec() noinline gains back ~400 ns. Making uclamp_rq_inc() noinline *too* cancels this gain out :-/diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 0464569f26a7..0835ee20a3c7 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c@@ -1071,13 +1071,11 @@ static inline void uclamp_rq_dec_id(struct rq *rq, struct task_struct *p, static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) { - enum uclamp_id clamp_id; - if (unlikely(!p->sched_class->uclamp_enabled)) return; - for_each_clamp_id(clamp_id) - uclamp_rq_inc_id(rq, p, clamp_id); + uclamp_rq_inc_id(rq, p, UCLAMP_MIN); + uclamp_rq_inc_id(rq, p, UCLAMP_MAX); /* Reset clamp idle holding when there is one RUNNABLE task */ if (rq->uclamp_flags & UCLAMP_FLAG_IDLE)@@ -1086,13 +1084,11 @@ static inline void uclamp_rq_inc(struct rq *rq, struct task_struct *p) static inline void uclamp_rq_dec(struct rq *rq, struct task_struct *p) { - enum uclamp_id clamp_id; - if (unlikely(!p->sched_class->uclamp_enabled)) return; - for_each_clamp_id(clamp_id) - uclamp_rq_dec_id(rq, p, clamp_id); + uclamp_rq_dec_id(rq, p, UCLAMP_MIN); + uclamp_rq_dec_id(rq, p, UCLAMP_MAX); } static inline voidFWIW I fail to see activate/deactivate_task in perf record. They don't show up on the list which means this micro benchmark doesn't stress them as Mel's test does.Strange because I have been able to trace them.On your arm platform? I can certainly see them on x86.
yes on my arm platform
Thanks
-- Qais Yousef