Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value
From: Mel Gorman <mgorman@suse.de>
Date: 2020-05-29 10:08:23
Also in:
linux-fsdevel, lkml
On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
quoted
FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead. Trying to see what goes on in there.Indeed, that one. The fact that regular distros cannot enable this feature due to performance overhead is unfortunate. It means there is a lot less potential for this stuff.
During that talk, I was a vague about the cost, admitted I had not looked too closely at mainline performance and had since deleted the data given that the problem was first spotted in early April. If I heard someone else making statements like I did at the talk, I would consider it a bit vague, potentially FUD, possibly wrong and worth rechecking myself. In terms of distributions "cannot enable this", we could but I was unwilling to pay the cost for a feature no one has asked for yet. If they had, I would endevour to put it behind static branches and disable it by default (like what happened for PSI). I was contacted offlist about my comments at OSPM and gathered new data to respond properly. For the record, here is an editted version of my response; --8<-- (Some context deleted that is not relevant)
Does it need any special admin configuration for system services, cgroups, scripts, etc?
Nothing special -- out of box configuration. Tests were executed via mmtests.
Which mmtests config file did you use?
I used network-netperf-unbound and network-netperf-cstate. network-netperf-unbound is usually the default but for some issues, I use the cstate configuration to limit C-states. For a perf profile, I used network-netperf-cstate-small and network-netperf-unbound-small to limit the amount of profile data that was collected. Just collecting data for 64 byte buffers was enough.
The server that I am going to configure is x86_64 numa, not arm64.
That's fine, I didn't actually test arm64 at all.
I have a 2 socket 24 CPUs X86 server (4 NUMA nodes, AMD Opteron 6174, L2 512KB/cpu, L3 6MB/node, RAM 40GB/node). Which machine did you run it on?
It was a 2-socket Haswell machine (E5-2670 v3) with 2 NUMA nodes. I used
5.7-rc7 with the openSUSE Leap 15.1 kernel configuration as a baseline.
I compared with and without uclamp enabled.
For network-netperf-unbound I see
netperf-udp
5.7.0-rc7 5.7.0-rc7
with-clamp without-clamp
Hmean send-64 238.52 ( 0.00%) 257.28 * 7.87%*
Hmean send-128 477.10 ( 0.00%) 511.57 * 7.23%*
Hmean send-256 945.53 ( 0.00%) 982.50 * 3.91%*
Hmean send-1024 3655.74 ( 0.00%) 3846.98 * 5.23%*
Hmean send-2048 6926.84 ( 0.00%) 7247.04 * 4.62%*
Hmean send-3312 10767.47 ( 0.00%) 10976.73 ( 1.94%)
Hmean send-4096 12821.77 ( 0.00%) 13506.03 * 5.34%*
Hmean send-8192 22037.72 ( 0.00%) 22275.29 ( 1.08%)
Hmean send-16384 35935.31 ( 0.00%) 34737.63 * -3.33%*
Hmean recv-64 238.52 ( 0.00%) 257.28 * 7.87%*
Hmean recv-128 477.10 ( 0.00%) 511.57 * 7.23%*
Hmean recv-256 945.45 ( 0.00%) 982.50 * 3.92%*
Hmean recv-1024 3655.74 ( 0.00%) 3846.98 * 5.23%*
Hmean recv-2048 6926.84 ( 0.00%) 7246.51 * 4.62%*
Hmean recv-3312 10767.47 ( 0.00%) 10975.93 ( 1.94%)
Hmean recv-4096 12821.76 ( 0.00%) 13506.02 * 5.34%*
Hmean recv-8192 22037.71 ( 0.00%) 22274.55 ( 1.07%)
Hmean recv-16384 35934.82 ( 0.00%) 34737.50 * -3.33%*
netperf-tcp
5.7.0-rc7 5.7.0-rc7
with-clamp without-clamp
Min 64 2004.71 ( 0.00%) 2033.23 ( 1.42%)
Min 128 3657.58 ( 0.00%) 3733.35 ( 2.07%)
Min 256 6063.25 ( 0.00%) 6105.67 ( 0.70%)
Min 1024 18152.50 ( 0.00%) 18487.00 ( 1.84%)
Min 2048 28544.54 ( 0.00%) 29218.11 ( 2.36%)
Min 3312 33962.06 ( 0.00%) 36094.97 ( 6.28%)
Min 4096 36234.82 ( 0.00%) 38223.60 ( 5.49%)
Min 8192 42324.06 ( 0.00%) 43328.72 ( 2.37%)
Min 16384 44323.33 ( 0.00%) 45315.21 ( 2.24%)
Hmean 64 2018.36 ( 0.00%) 2038.53 * 1.00%*
Hmean 128 3700.12 ( 0.00%) 3758.20 * 1.57%*
Hmean 256 6236.14 ( 0.00%) 6212.77 ( -0.37%)
Hmean 1024 18214.97 ( 0.00%) 18601.01 * 2.12%*
Hmean 2048 28749.56 ( 0.00%) 29728.26 * 3.40%*
Hmean 3312 34585.50 ( 0.00%) 36345.09 * 5.09%*
Hmean 4096 36777.62 ( 0.00%) 38576.17 * 4.89%*
Hmean 8192 43149.08 ( 0.00%) 43903.77 * 1.75%*
Hmean 16384 45478.27 ( 0.00%) 46372.93 ( 1.97%)
The cstate-limited config had similar results for UDP_STREAM but was
mostly indifferent for TCP_STREAM.
So for UDP_STREAM,. there is a fairly sizable difference for uclamp. There
are caveats, netperf is not 100% stable from a performance perspective on
NUMA machines. That's improved quite a bit with 5.7 but it still should
be treated with care.
When I first saw a problem, I was using ftrace looking for latencies and
uclamp appeared to crop up. As I didn't actually need uclamp and there was
no user request to support it, I simply dropped it from the master config
so it would get propogated to any distro we release with a 5.x kernel.
From a perf profile, it's not particularly obvious that uclamp is
involved so it could be in error but I doubt it. A diff of without vs
with looks like
# Event 'cycles:ppp'
#
# Baseline Delta Abs Shared Object Symbol
# ........ ......... ........................ ..............................................
#
9.59% -2.87% [kernel.vmlinux] [k] poll_idle
0.19% +1.85% [kernel.vmlinux] [k] activate_task
+1.17% [kernel.vmlinux] [k] dequeue_task
+0.89% [kernel.vmlinux] [k] update_rq_clock.part.73
3.88% +0.73% [kernel.vmlinux] [k] try_to_wake_up
3.17% +0.68% [kernel.vmlinux] [k] __schedule
1.16% -0.60% [kernel.vmlinux] [k] __update_load_avg_cfs_rq
2.20% -0.54% [kernel.vmlinux] [k] resched_curr
2.08% -0.29% [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.44% -0.29% [kernel.vmlinux] [k] cpus_share_cache
1.13% +0.23% [kernel.vmlinux] [k] _raw_spin_lock_bh
A lot of the uclamp functions appear to be inlined so it is not be
particularly obvious from a raw profile but it shows up in the annotated
profile in activate_task and dequeue_task for example. In the case of
dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
annotated profile.
I'm afraid I did not dig into this deeply once I knew I could just disable
it even within the distribution.
--
Mel Gorman
SUSE Labs