Thread (50 messages) 50 messages, 10 authors, 2020-06-24

Re: [PATCH 1/2] sched/uclamp: Add a new sysctl to control RT default boost value

From: Mel Gorman <mgorman@suse.de>
Date: 2020-05-29 10:08:23
Also in: linux-fsdevel, lkml

On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote:
quoted
FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead.
Trying to see what goes on in there.
Indeed, that one. The fact that regular distros cannot enable this
feature due to performance overhead is unfortunate. It means there is a
lot less potential for this stuff.
During that talk, I was a vague about the cost, admitted I had not looked
too closely at mainline performance and had since deleted the data given
that the problem was first spotted in early April. If I heard someone
else making statements like I did at the talk, I would consider it a bit
vague, potentially FUD, possibly wrong and worth rechecking myself. In
terms of distributions "cannot enable this", we could but I was unwilling
to pay the cost for a feature no one has asked for yet. If they had, I
would endevour to put it behind static branches and disable it by default
(like what happened for PSI). I was contacted offlist about my comments
at OSPM and gathered new data to respond properly. For the record, here
is an editted version of my response;

--8<--

(Some context deleted that is not relevant)
Does it need any special admin configuration for system
services, cgroups, scripts, etc?
Nothing special -- out of box configuration. Tests were executed via
mmtests.
Which mmtests config file did you use?
I used network-netperf-unbound and network-netperf-cstate.
network-netperf-unbound is usually the default but for some issues, I
use the cstate configuration to limit C-states.

For a perf profile, I used network-netperf-cstate-small and
network-netperf-unbound-small to limit the amount of profile data that
was collected. Just collecting data for 64 byte buffers was enough.
The server that I am going to configure is x86_64 numa, not arm64.
That's fine, I didn't actually test arm64 at all.
I have a 2 socket 24 CPUs X86 server (4 NUMA nodes, AMD Opteron 6174,
L2 512KB/cpu, L3 6MB/node, RAM 40GB/node).
Which machine did you run it on?
It was a 2-socket Haswell machine (E5-2670 v3) with 2 NUMA nodes. I used
5.7-rc7 with the openSUSE Leap 15.1 kernel configuration as a baseline.
I compared with and without uclamp enabled.

For network-netperf-unbound I see

netperf-udp
                                  5.7.0-rc7              5.7.0-rc7
                                 with-clamp          without-clamp
Hmean     send-64         238.52 (   0.00%)      257.28 *   7.87%*
Hmean     send-128        477.10 (   0.00%)      511.57 *   7.23%*
Hmean     send-256        945.53 (   0.00%)      982.50 *   3.91%*
Hmean     send-1024      3655.74 (   0.00%)     3846.98 *   5.23%*
Hmean     send-2048      6926.84 (   0.00%)     7247.04 *   4.62%*
Hmean     send-3312     10767.47 (   0.00%)    10976.73 (   1.94%)
Hmean     send-4096     12821.77 (   0.00%)    13506.03 *   5.34%*
Hmean     send-8192     22037.72 (   0.00%)    22275.29 (   1.08%)
Hmean     send-16384    35935.31 (   0.00%)    34737.63 *  -3.33%*
Hmean     recv-64         238.52 (   0.00%)      257.28 *   7.87%*
Hmean     recv-128        477.10 (   0.00%)      511.57 *   7.23%*
Hmean     recv-256        945.45 (   0.00%)      982.50 *   3.92%*
Hmean     recv-1024      3655.74 (   0.00%)     3846.98 *   5.23%*
Hmean     recv-2048      6926.84 (   0.00%)     7246.51 *   4.62%*
Hmean     recv-3312     10767.47 (   0.00%)    10975.93 (   1.94%)
Hmean     recv-4096     12821.76 (   0.00%)    13506.02 *   5.34%*
Hmean     recv-8192     22037.71 (   0.00%)    22274.55 (   1.07%)
Hmean     recv-16384    35934.82 (   0.00%)    34737.50 *  -3.33%*

netperf-tcp
                             5.7.0-rc7              5.7.0-rc7
                            with-clamp          without-clamp
Min       64        2004.71 (   0.00%)     2033.23 (   1.42%)
Min       128       3657.58 (   0.00%)     3733.35 (   2.07%)
Min       256       6063.25 (   0.00%)     6105.67 (   0.70%)
Min       1024     18152.50 (   0.00%)    18487.00 (   1.84%)
Min       2048     28544.54 (   0.00%)    29218.11 (   2.36%)
Min       3312     33962.06 (   0.00%)    36094.97 (   6.28%)
Min       4096     36234.82 (   0.00%)    38223.60 (   5.49%)
Min       8192     42324.06 (   0.00%)    43328.72 (   2.37%)
Min       16384    44323.33 (   0.00%)    45315.21 (   2.24%)
Hmean     64        2018.36 (   0.00%)     2038.53 *   1.00%*
Hmean     128       3700.12 (   0.00%)     3758.20 *   1.57%*
Hmean     256       6236.14 (   0.00%)     6212.77 (  -0.37%)
Hmean     1024     18214.97 (   0.00%)    18601.01 *   2.12%*
Hmean     2048     28749.56 (   0.00%)    29728.26 *   3.40%*
Hmean     3312     34585.50 (   0.00%)    36345.09 *   5.09%*
Hmean     4096     36777.62 (   0.00%)    38576.17 *   4.89%*
Hmean     8192     43149.08 (   0.00%)    43903.77 *   1.75%*
Hmean     16384    45478.27 (   0.00%)    46372.93 (   1.97%)

The cstate-limited config had similar results for UDP_STREAM but was
mostly indifferent for TCP_STREAM.

So for UDP_STREAM,. there is a fairly sizable difference for uclamp. There
are caveats, netperf is not 100% stable from a performance perspective on
NUMA machines. That's improved quite a bit with 5.7 but it still should
be treated with care.

When I first saw a problem, I was using ftrace looking for latencies and
uclamp appeared to crop up. As I didn't actually need uclamp and there was
no user request to support it, I simply dropped it from the master config
so it would get propogated to any distro we release with a 5.x kernel.

From a perf profile, it's not particularly obvious that uclamp is
involved so it could be in error but I doubt it. A diff of without vs
with looks like

# Event 'cycles:ppp'
#
# Baseline  Delta Abs  Shared Object             Symbol
# ........  .........  ........................  ..............................................
#
     9.59%     -2.87%  [kernel.vmlinux]          [k] poll_idle
     0.19%     +1.85%  [kernel.vmlinux]          [k] activate_task
               +1.17%  [kernel.vmlinux]          [k] dequeue_task
               +0.89%  [kernel.vmlinux]          [k] update_rq_clock.part.73
     3.88%     +0.73%  [kernel.vmlinux]          [k] try_to_wake_up
     3.17%     +0.68%  [kernel.vmlinux]          [k] __schedule
     1.16%     -0.60%  [kernel.vmlinux]          [k] __update_load_avg_cfs_rq
     2.20%     -0.54%  [kernel.vmlinux]          [k] resched_curr
     2.08%     -0.29%  [kernel.vmlinux]          [k] _raw_spin_lock_irqsave
     0.44%     -0.29%  [kernel.vmlinux]          [k] cpus_share_cache
     1.13%     +0.23%  [kernel.vmlinux]          [k] _raw_spin_lock_bh

A lot of the uclamp functions appear to be inlined so it is not be
particularly obvious from a raw profile but it shows up in the annotated
profile in activate_task and dequeue_task for example. In the case of
dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the
annotated profile.

I'm afraid I did not dig into this deeply once I knew I could just disable
it even within the distribution.

-- 
Mel Gorman
SUSE Labs
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help