--- v5
+++ v10
@@ -2,76 +2,58 @@
Hi,
-This is the fifth iteration of the patchset to add support for
-big-core on POWER9.
+This is the tenth iteration of the patchset to add support for
+big-core on POWER9. This patch also optimizes the task placement on
+such big-core systems.
The previous versions can be found here:
-
+v9: https://lkml.org/lkml/2018/10/1/608
+v8: https://lkml.org/lkml/2018/9/20/899
+v7: https://lkml.org/lkml/2018/8/20/52
+v6: https://lkml.org/lkml/2018/8/9/119
+v5: https://lkml.org/lkml/2018/8/6/587
v4: https://lkml.org/lkml/2018/7/24/79
v3: https://lkml.org/lkml/2018/7/6/255
v2: https://lkml.org/lkml/2018/7/3/401
v1: https://lkml.org/lkml/2018/5/11/245
Changes :
-
-v4 --> v5:
- - Patch 2 is entirely different: Instead of using CPU_FTR_ASYM_SMT
- feature, use the small core siblings at the SMT level
- sched-domain. This was suggested by Nicholas Piggin and Michael
- Ellerman.
-
- - A more detailed description follows below.
-
-v3 --> v4:
- - Build fix for powerpc-g5 : Enable CPU_FTR_ASYM_SMT only on
- CONFIG_PPC_POWERNV and CONFIG_PPC_PSERIES.
- - Fixed a minor error in the ABI description.
-
-v2 --> v3
- - Set sane values in the tg->property, tg->nr_groups inside
- parse_thread_groups before returning due to an error.
- - Define a helper function to determine whether a CPU device node
- is a big-core or not.
- - Updated the comments around the functions to describe the
- arguments passed to them.
-
-v1 --> v2
- - Added comments explaining the "ibm,thread-groups" device tree property.
- - Uses cleaner device-tree parsing functions to parse the u32 arrays.
- - Adds a sysfs file listing the small-core siblings for every CPU.
- - Enables the scheduler optimization by setting the CPU_FTR_ASYM_SMT bit
- in the cur_cpu_spec->cpu_features on detecting the presence
- of interleaved big-core.
- - Handles the corner case where there is only a single thread-group
- or when there is a single thread in a thread-group.
+v9 --> v10:
+ - Rebased it on v4.19-rc7
+ - Added a patch to report the correct shared_cpu_map for L1-caches
+ on big-core systems.
Description:
~~~~~~~~~~~~~~~~~~~~
-A pair of IBM POWER9 SMT4 cores can be fused together to form a
-big-core with 8 SMT threads. This can be discovered via the
-"ibm,thread-groups" CPU property in the device tree which will
-indicate which group of threads that share the L1 cache, translation
-cache and instruction data flow. If there are multiple such group of
-threads, then the core is a big-core. Furthermore, on POWER9 the thread-ids of
-such a big-core is obtained by interleaving the thread-ids of the
-component SMT4 cores.
-
-Eg: Threads in the pair of component SMT4 cores of an interleaved
-big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.
-
- --------------------------
- | | | | |
- | 0 | 2 | 4 | 6 | Small Core0
- | | | | |
-Big Core --------------------------
- | | | | |
- | 1 | 3 | 5 | 7 | Small Core1
- | | | | |
- --------------------------
+
+IBM POWER9 SMT8 cores consists of two groups of small-cores where each
+group has its own L1 cache, translation cache and instruction-data
+flow. This can be discovered via the "ibm,thread-groups" CPU property
+in the device tree. Furthermore, on POWER9 the thread-ids of such a
+big-core is obtained by interleaving the thread-ids of the two
+small-cores.
+
+Eg: In an SMT8 core with thread ids {0,1,2,3,4,5,6,7}, the thread-ids
+of the threads in the two small-cores respectively will be {0,2,4,6}
+and {1,3,5,7} respectively.
+
+ -------------------------
+ | L1 Cache |
+ ----------------------------------
+ |L2| | | | |
+ | | 0 | 2 | 4 | 6 |Small Core0
+ |C | | | | |
+Big |a --------------------------
+Core |c | | | | |
+ |h | 1 | 3 | 5 | 7 | Small Core1
+ |e | | | | |
+ -----------------------------
+ | L1 Cache |
+ --------------------------
On such a big-core system, when multiple tasks are scheduled to run on
the big-core, we get the best performance when the tasks are spread
-across the pair of SMT4 cores.
+across the pair of small-cores.
Eg: Suppose there 4 tasks {p1, p2, p3, p4} are run on a big core, then
@@ -97,12 +79,27 @@
| | (p3)| | |
--------------------------
-In order to achieve optimal task placement, on big-core systems, we
-define the he SMT level sched-domain to consist of the threads
-belonging to the small cores. With this, the Linux Kernel
-load-balancer will ensure that the tasks are spread across all the
-component small cores in the system, thereby yielding optimum
-performance.
+Currently on the big-core systems, the sched domain hierarchy is:
+
+SMT : group of CPUs in the SMT8 core.
+DIE : groups of CPUs on the same die.
+NUMA : all the CPUs in the system.
+
+Thus the scheduler doesn't distinguish between CPUs in the core that
+share the L1-cache vs the ones that don't resulting in a run-to-run
+variance when multithreaded applications are run on an SMT8 core.
+
+In this patch-set, we address this by defining the sched-domain on the
+big-core systems to be:
+
+SMT : group of CPUs sharing the L1 cache
+CACHE : group of CPUs in the SMT8 core.
+DIE : groups of CPUs on the same die.
+NUMA : all the CPUs in the system.
+
+With this, the Linux Kernel load-balancer will ensure that the tasks
+are spread across all the component small cores in the system, thereby
+yielding optimum performance.
Furthermore, this solution works correctly across all SMT modes
(8,4,2), as the interleaved thread-ids ensures that when we go to
@@ -110,156 +107,102 @@
thereby leaving equal number of threads from the component small cores
online as illustrated below.
-With Patches: (ppc64_cpu --smt=on) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
- domain-0: span=0,2,4,6 level=SMT
- groups: 0:{ span=0 cap=294 }, 2:{ span=2 cap=294 },
- 4:{ span=4 cap=294 }, 6:{ span=6 cap=294 }
- CPU1 attaching sched-domain(s):
- domain-0: span=1,3,5,7 level=SMT
- groups: 1:{ span=1 cap=294 }, 3:{ span=3 cap=294 },
- 5:{ span=5 cap=294 }, 7:{ span=7 cap=294 }
-
- Optimal Task placement (SMT 8)
- --------------------------
- | | | | |
- | 0 | 2 | 4 | 6 | Small Core0
- | (p1)| (p2)| | |
-Big Core --------------------------
- | | | | |
- | 1 | 3 | 5 | 7 | Small Core1
- | | (p3)| | (p4) |
- --------------------------
-
-With Patches : (ppc64_cpu --smt=4) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
- domain-0: span=0,2 level=SMT
- groups: 0:{ span=0 cap=589 }, 2:{ span=2 cap=589 }
- CPU1 attaching sched-domain(s):
- domain-0: span=1,3 level=SMT
- groups: 1:{ span=1 cap=589 }, 3:{ span=3 cap=589 }
-
- Optimal Task placement (SMT 4)
- --------------------------
- | | | | |
- | 0 | 2 | 4 | 6 | Small Core0
- | (p1)| (p2)| Off | Off |
-Big Core --------------------------
- | | | | |
- | 1 | 3 | 5 | 7 | Small Core1
- | (p4)| (p3)| Off | Off |
- --------------------------
-
-With Patches : (ppc64_cpu --smt=2) : SMT domain ceases to exist.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Optimal Task placement (SMT 2)
- --------------------------
- | (p2)| | | |
- | 0 | 2 | 4 | 6 | Small Core0
- | (p1)| Off | Off | Off |
-Big Core --------------------------
- | (p3)| | | |
- | 1 | 3 | 5 | 7 | Small Core1
- | (p4)| Off | Off | Off |
- --------------------------
-
-Thus, as an added advantage in SMT=2 mode, we will only have 2 levels
-in the sched-domain topology (DIE and NUMA).
-
-The SMT levels, without the patches are as follows.
-
-Without Patches: (ppc64_cpu --smt=on) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
- domain-0: span=0-7 level=SMT
- groups: 0:{ span=0 cap=147 }, 1:{ span=1 cap=147 },
- 2:{ span=2 cap=147 }, 3:{ span=3 cap=147 },
- 4:{ span=4 cap=147 }, 5:{ span=5 cap=147 },
- 6:{ span=6 cap=147 }, 7:{ span=7 cap=147 }
- CPU1 attaching sched-domain(s):
- domain-0: span=0-7 level=SMT
- groups: 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 },
- 3:{ span=3 cap=147 }, 4:{ span=4 cap=147 },
- 5:{ span=5 cap=147 }, 6:{ span=6 cap=147 },
- 7:{ span=7 cap=147 }, 0:{ span=0 cap=147 }
-
-Without Patches: (ppc64_cpu --smt=4) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
- domain-0: span=0-3 level=SMT
- groups: 0:{ span=0 cap=294 }, 1:{ span=1 cap=294 },
- 2:{ span=2 cap=294 }, 3:{ span=3 cap=294 },
- CPU1 attaching sched-domain(s):
- domain-0: span=0-3 level=SMT
- groups: 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 },
- 3:{ span=3 cap=294 }, 0:{ span=0 cap=294 }
-
-Without Patches: (ppc64_cpu --smt=2) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
- domain-0: span=0-1 level=SMT
- groups: 0:{ span=0 cap=589 }, 1:{ span=1 cap=589 },
-
- CPU1 attaching sched-domain(s):
- domain-0: span=0-1 level=SMT
- groups: 1:{ span=1 cap=589 }, 0:{ span=0 cap=589 },
-
-This patchset contains two patches which on detecting the presence of
-big-cores, defines the SMT level sched domain to correspond to the
+This patchset contains three patches which on detecting the presence
+of big-cores, defines the SMT level sched domain to correspond to the
threads of the small cores.
Patch 1: adds support to detect the presence of
-big-cores and reports the small-core siblings of each CPU X
-via the sysfs file "/sys/devices/system/cpu/cpuX/small_core_siblings".
+big-cores and parses the output of "ibm,thread-groups" device-tree
+which using which it updates a per-cpu mask named cpu_smallcore_mask
Patch 2: Defines the SMT level sched domain to correspond to the
threads of the small cores.
+Patch 3: Added a patch to report the correct shared_cpu_map for L1-caches
+on big-core systems.
+
+ Without patch 3:
+ /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 000000ff
+ /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 000000ff
+ /sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000ff
+ /sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000ff
+
+ With patch 3:
+ /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 00000055
+ /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 00000055
+ /sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000aa
+ /sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000aa
+
Results:
~~~~~~~~~~~~~~~~~
+1) 2 thread ebizzy
+~~~~~~~~~~~~~~~~~~~~~~
Experimental results for ebizzy with 2 threads, bound to a single big-core
-show a marked improvement with this patchset over the 4.18-rc5 vanilla
+show a marked improvement with this patchset over the 4.19.0-rc7 vanilla
kernel.
-The result of 100 such runs for 4.18-rc7 kernel and the 4.18-rc7 +
-big-core-smt-patches are as follows
-
-4.18.0-rc7 vanilla
+The result of 100 such runs for 4.19.0-rc7 kernel and the
+4.19.0-rc7 + big-core-patches are as follows
+
+4.19.0-rc7 vanilla
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
records/s : # samples : Histogram
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-[ 0 - 1000000] : 0 : #
-[1000000 - 2000000] : 3 : #
-[2000000 - 3000000] : 7 : ##
-[3000000 - 4000000] : 26 : ######
-[4000000 - 5000000] : 4 : #
-[5000000 - 6000000] : 60 : #############
-
-4.18.0-rc7 + big-core-smt-patches
+[0 - 1000000] : 0 : #
+[1000000 - 2000000] : 2 : #
+[2000000 - 3000000] : 8 : ##
+[3000000 - 4000000] : 19 : ####
+[4000000 - 5000000] : 7 : ##
+[5000000 - 6000000] : 2 : #
+[6000000 - 7000000] : 62 : #############
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+4.19.0-rc7 + big-core-patches
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
records/s : # samples : Histogram
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-[ 0 - 1000000] : 0 : #
+[0 - 1000000] : 0 : #
[1000000 - 2000000] : 0 : #
-[2000000 - 3000000] : 11 : ###
-[3000000 - 4000000] : 0 : #
+[2000000 - 3000000] : 4 : #
+[3000000 - 4000000] : 8 : ##
[4000000 - 5000000] : 0 : #
-[5000000 - 6000000] : 89 : ##################
-
-
-Gautham R. Shenoy (2):
+[5000000 - 6000000] : 1 : #
+[6000000 - 7000000] : 87 : ##################
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+2) Hackbench (perf bench sched pipe)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+500 iterations of the hackbench run both on 4.19.0-rc7 vanilla kernel
+and v4.19.0-rc7 + big-core-patches. There isn't a significant
+difference between the two.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ 4.19.0-rc7 vanilla
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ N Min Max Median Avg Stddev
+ 500 4.658s 6.293s 6.076s 5.846528s 0.45096266
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ 4.19.0-rc7 + big-core-patches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ N Min Max Median Avg Stddev
+ 500 4.543s 6.3s 5.75s 5.682208s 0.50767805
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+Gautham R. Shenoy (3):
powerpc: Detect the presence of big-cores via "ibm,thread-groups"
powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores
-
- Documentation/ABI/testing/sysfs-devices-system-cpu | 8 ++
- arch/powerpc/include/asm/cputhreads.h | 22 +++
- arch/powerpc/include/asm/smp.h | 6 +
- arch/powerpc/kernel/setup-common.c | 154 +++++++++++++++++++++
- arch/powerpc/kernel/smp.c | 55 +++++++-
- arch/powerpc/kernel/sysfs.c | 35 +++++
- 6 files changed, 276 insertions(+), 4 deletions(-)
+ powerpc/cacheinfo: Report the correct shared_cpu_map on big-cores
+
+ arch/powerpc/include/asm/cputhreads.h | 2 +
+ arch/powerpc/include/asm/smp.h | 11 ++
+ arch/powerpc/kernel/cacheinfo.c | 37 +++++-
+ arch/powerpc/kernel/smp.c | 241 +++++++++++++++++++++++++++++++++-
+ 4 files changed, 288 insertions(+), 3 deletions(-)
--
1.9.4
+