Inter-revision diff: cover letter

Comparing v5 (message) to v10 (message)

--- v5
+++ v10
@@ -2,76 +2,58 @@
 
 Hi,
 
-This is the fifth iteration of the patchset to add support for
-big-core on POWER9.
+This is the tenth iteration of the patchset to add support for
+big-core on POWER9. This patch also optimizes the task placement on
+such big-core systems.
 
 The previous versions can be found here:
-
+v9: https://lkml.org/lkml/2018/10/1/608
+v8: https://lkml.org/lkml/2018/9/20/899
+v7: https://lkml.org/lkml/2018/8/20/52
+v6: https://lkml.org/lkml/2018/8/9/119
+v5: https://lkml.org/lkml/2018/8/6/587
 v4: https://lkml.org/lkml/2018/7/24/79
 v3: https://lkml.org/lkml/2018/7/6/255
 v2: https://lkml.org/lkml/2018/7/3/401
 v1: https://lkml.org/lkml/2018/5/11/245
 
 Changes :
-
-v4 --> v5:
-   - Patch 2 is entirely different: Instead of using CPU_FTR_ASYM_SMT
-     feature, use the small core siblings at the SMT level
-     sched-domain. This was suggested by Nicholas Piggin and Michael
-     Ellerman.
-
-   - A more detailed description follows below.
-
-v3 --> v4:
-   - Build fix for powerpc-g5 : Enable CPU_FTR_ASYM_SMT only on
-     CONFIG_PPC_POWERNV and CONFIG_PPC_PSERIES.
-   - Fixed a minor error in the ABI description.
-
-v2 --> v3
-    - Set sane values in the tg->property, tg->nr_groups inside
-    parse_thread_groups before returning due to an error.
-    - Define a helper function to determine whether a CPU device node
-      is a big-core or not.
-    - Updated the comments around the functions to describe the
-      arguments passed to them.
-
-v1 --> v2
-    - Added comments explaining the "ibm,thread-groups" device tree property.
-    - Uses cleaner device-tree parsing functions to parse the u32 arrays.
-    - Adds a sysfs file listing the small-core siblings for every CPU.
-    - Enables the scheduler optimization by setting the CPU_FTR_ASYM_SMT bit
-      in the cur_cpu_spec->cpu_features on detecting the presence
-      of interleaved big-core.
-    - Handles the corner case where there is only a single thread-group
-      or when there is a single thread in a thread-group.
+v9 --> v10:
+   - Rebased it on v4.19-rc7
+   - Added a patch to report the correct shared_cpu_map for L1-caches
+   on big-core systems.
 
 Description:
 ~~~~~~~~~~~~~~~~~~~~
-A pair of IBM POWER9 SMT4 cores can be fused together to form a
-big-core with 8 SMT threads. This can be discovered via the
-"ibm,thread-groups" CPU property in the device tree which will
-indicate which group of threads that share the L1 cache, translation
-cache and instruction data flow.  If there are multiple such group of
-threads, then the core is a big-core. Furthermore, on POWER9 the thread-ids of
-such a big-core is obtained by interleaving the thread-ids of the
-component SMT4 cores.
-
-Eg: Threads in the pair of component SMT4 cores of an interleaved
-big-core are numbered {0,2,4,6} and {1,3,5,7} respectively.
-
-	   --------------------------
-           |     |     |     |      |
-           |  0  |  2  |  4  |  6   |   Small Core0
-           |     |     |     |      |
-Big Core   --------------------------
-           |     |     |     |      |
-           |  1  |  3  |  5  |  7   |   Small Core1
-           |     |     |     |      |
-           --------------------------
+
+IBM POWER9 SMT8 cores consists of two groups of small-cores where each
+group has its own L1 cache, translation cache and instruction-data
+flow. This can be discovered via the "ibm,thread-groups" CPU property
+in the device tree. Furthermore, on POWER9 the thread-ids of such a
+big-core is obtained by interleaving the thread-ids of the two
+small-cores.
+
+Eg: In an SMT8 core with thread ids {0,1,2,3,4,5,6,7}, the thread-ids
+of the threads in the two small-cores respectively will be {0,2,4,6}
+and {1,3,5,7} respectively.
+
+ 	   -------------------------
+	   |  	    L1 Cache       |
+       ----------------------------------
+       |L2|     |     |     |      |
+       |  |  0  |  2  |  4  |  6   |Small Core0
+       |C |     |     |     |      |
+Big    |a --------------------------
+Core   |c |     |     |     |      |
+       |h |  1  |  3  |  5  |  7   | Small Core1
+       |e |     |     |     |      |
+       -----------------------------
+	  |  	    L1 Cache       |
+	  --------------------------
 
 On such a big-core system, when multiple tasks are scheduled to run on
 the big-core, we get the best performance when the tasks are spread
-across the pair of SMT4 cores.
+across the pair of small-cores.
 
 Eg: Suppose there 4 tasks {p1, p2, p3, p4} are run on a big core, then
 
@@ -97,12 +79,27 @@
            |     | (p3)|     |      |
            --------------------------
 
-In order to achieve optimal task placement, on big-core systems, we
-define the he SMT level sched-domain to consist of the threads
-belonging to the small cores. With this, the Linux Kernel
-load-balancer will ensure that the tasks are spread across all the
-component small cores in the system, thereby yielding optimum
-performance.
+Currently on the big-core systems, the sched domain hierarchy is:
+
+SMT   : group of CPUs in the SMT8 core.
+DIE   : groups of CPUs on the same die.
+NUMA  : all the CPUs in the system.
+
+Thus the scheduler doesn't distinguish between CPUs in the core that
+share the L1-cache vs the ones that don't resulting in a run-to-run
+variance when multithreaded applications are run on an SMT8 core.
+
+In this patch-set, we address this by defining the sched-domain on the
+big-core systems to be:
+
+SMT   : group of CPUs sharing the L1 cache
+CACHE : group of CPUs in the SMT8 core.
+DIE   : groups of CPUs on the same die.
+NUMA  : all the CPUs in the system.
+
+With this, the Linux Kernel load-balancer will ensure that the tasks
+are spread across all the component small cores in the system, thereby
+yielding optimum performance.
 
 Furthermore, this solution works correctly across all SMT modes
 (8,4,2), as the interleaved thread-ids ensures that when we go to
@@ -110,156 +107,102 @@
 thereby leaving equal number of threads from the component small cores
 online as illustrated below.
 
-With Patches: (ppc64_cpu --smt=on) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
-  domain-0: span=0,2,4,6 level=SMT
-   groups: 0:{ span=0 cap=294 }, 2:{ span=2 cap=294 },
-           4:{ span=4 cap=294 }, 6:{ span=6 cap=294 }
- CPU1 attaching sched-domain(s):
-  domain-0: span=1,3,5,7 level=SMT
-   groups: 1:{ span=1 cap=294 }, 3:{ span=3 cap=294 },
-           5:{ span=5 cap=294 }, 7:{ span=7 cap=294 }
-
-            Optimal Task placement (SMT 8)
-	   --------------------------
-           |     |     |     |      |
-           |  0  |  2  |  4  |  6   |   Small Core0
-           | (p1)| (p2)|     |      |
-Big Core   --------------------------
-           |     |     |     |      |
-           |  1  |  3  |  5  |  7   |   Small Core1
-           |     | (p3)|     | (p4) |
-           --------------------------
-
-With Patches : (ppc64_cpu --smt=4) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
-  domain-0: span=0,2 level=SMT
-   groups: 0:{ span=0 cap=589 }, 2:{ span=2 cap=589 }
- CPU1 attaching sched-domain(s):
-  domain-0: span=1,3 level=SMT
-   groups: 1:{ span=1 cap=589 }, 3:{ span=3 cap=589 }
-
-            Optimal Task placement (SMT 4)
-	   --------------------------
-           |     |     |     |      |
-           |  0  |  2  |  4  |  6   |   Small Core0
-           | (p1)| (p2)| Off | Off  |
-Big Core   --------------------------
-           |     |     |     |      |
-           |  1  |  3  |  5  |  7   |   Small Core1
-           | (p4)| (p3)| Off | Off  |
-           --------------------------
-
-With Patches : (ppc64_cpu --smt=2) : SMT domain ceases to exist.
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-            Optimal Task placement (SMT 2)
-	   --------------------------
-           | (p2)|     |     |      |
-           |  0  |  2  |  4  |  6   |   Small Core0
-           | (p1)| Off | Off | Off  |
-Big Core   --------------------------
-           | (p3)|     |     |      |
-           |  1  |  3  |  5  |  7   |   Small Core1
-           | (p4)| Off | Off | Off  |
-           --------------------------
-
-Thus, as an added advantage in SMT=2 mode, we will only have 2 levels
-in the sched-domain topology (DIE and NUMA).
-
-The SMT levels, without the patches are as follows.
-
-Without Patches: (ppc64_cpu --smt=on) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
-  domain-0: span=0-7 level=SMT
-   groups: 0:{ span=0 cap=147 }, 1:{ span=1 cap=147 },
-           2:{ span=2 cap=147 }, 3:{ span=3 cap=147 },
-           4:{ span=4 cap=147 }, 5:{ span=5 cap=147 },
-	   6:{ span=6 cap=147 }, 7:{ span=7 cap=147 }
- CPU1 attaching sched-domain(s):
-  domain-0: span=0-7 level=SMT
-   groups: 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 },
-           3:{ span=3 cap=147 }, 4:{ span=4 cap=147 },
-	   5:{ span=5 cap=147 }, 6:{ span=6 cap=147 },
-	   7:{ span=7 cap=147 }, 0:{ span=0 cap=147 }
-
-Without Patches: (ppc64_cpu --smt=4) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
-  domain-0: span=0-3 level=SMT
-   groups: 0:{ span=0 cap=294 }, 1:{ span=1 cap=294 },
-           2:{ span=2 cap=294 }, 3:{ span=3 cap=294 },
- CPU1 attaching sched-domain(s):
-  domain-0: span=0-3 level=SMT
-   groups: 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 },
-           3:{ span=3 cap=294 }, 0:{ span=0 cap=294 }
-
-Without Patches: (ppc64_cpu --smt=2) : SMT domain
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- CPU0 attaching sched-domain(s):
-  domain-0: span=0-1 level=SMT
-   groups: 0:{ span=0 cap=589 }, 1:{ span=1 cap=589 },
-
- CPU1 attaching sched-domain(s):
-  domain-0: span=0-1 level=SMT
-   groups: 1:{ span=1 cap=589 }, 0:{ span=0 cap=589 },
-
-This patchset contains two patches which on detecting the presence of
-big-cores, defines the SMT level sched domain to correspond to the
+This patchset contains three patches which on detecting the presence
+of big-cores, defines the SMT level sched domain to correspond to the
 threads of the small cores.
 
 Patch 1: adds support to detect the presence of
-big-cores and reports the small-core siblings of each CPU X
-via the sysfs file "/sys/devices/system/cpu/cpuX/small_core_siblings".
+big-cores and parses the output of "ibm,thread-groups" device-tree
+which using which it updates a per-cpu mask named cpu_smallcore_mask
 
 Patch 2: Defines the SMT level sched domain to correspond to the
 threads of the small cores.
 
+Patch 3: Added a patch to report the correct shared_cpu_map for L1-caches
+on big-core systems.
+
+   Without patch 3:
+       /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 000000ff
+       /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 000000ff
+       /sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000ff
+       /sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000ff
+
+    With patch 3:
+       /sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_map : 00000055
+       /sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_map : 00000055
+       /sys/devices/system/cpu/cpu1/cache/index0/shared_cpu_map : 000000aa
+       /sys/devices/system/cpu/cpu1/cache/index1/shared_cpu_map : 000000aa
+
 Results:
 ~~~~~~~~~~~~~~~~~
+1) 2 thread ebizzy
+~~~~~~~~~~~~~~~~~~~~~~
 Experimental results for ebizzy with 2 threads, bound to a single big-core
-show a marked improvement with this patchset over the 4.18-rc5 vanilla
+show a marked improvement with this patchset over the 4.19.0-rc7 vanilla
 kernel.
 
-The result of 100 such runs for 4.18-rc7 kernel and the 4.18-rc7 +
-big-core-smt-patches are as follows
-
-4.18.0-rc7 vanilla
+The result of 100 such runs for 4.19.0-rc7 kernel and the
+4.19.0-rc7 + big-core-patches are as follows
+
+4.19.0-rc7 vanilla
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         records/s    :  # samples  : Histogram
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-[      0 - 1000000]  :      0      : #
-[1000000 - 2000000]  :      3      : #
-[2000000 - 3000000]  :      7      : ##
-[3000000 - 4000000]  :      26     : ######
-[4000000 - 5000000]  :      4      : #
-[5000000 - 6000000]  :      60     : #############
-
-4.18.0-rc7 + big-core-smt-patches
+[0       - 1000000]  :      0      : #
+[1000000 - 2000000]  :      2      : #
+[2000000 - 3000000]  :      8      : ##
+[3000000 - 4000000]  :      19     : ####
+[4000000 - 5000000]  :      7      : ##
+[5000000 - 6000000]  :      2      : #
+[6000000 - 7000000]  :      62     : #############
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+4.19.0-rc7 + big-core-patches
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         records/s    :  # samples  : Histogram
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-[      0 - 1000000]  :      0      : #
+[0       - 1000000]  :      0      : #
 [1000000 - 2000000]  :      0      : #
-[2000000 - 3000000]  :      11     : ###
-[3000000 - 4000000]  :      0      : #
+[2000000 - 3000000]  :      4      : #
+[3000000 - 4000000]  :      8      : ##
 [4000000 - 5000000]  :      0      : #
-[5000000 - 6000000]  :      89     : ##################
-
-
-Gautham R. Shenoy (2):
+[5000000 - 6000000]  :      1      : #
+[6000000 - 7000000]  :      87     : ##################
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+2) Hackbench (perf bench sched pipe)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+500 iterations of the hackbench run both on 4.19.0-rc7 vanilla kernel
+and v4.19.0-rc7 + big-core-patches. There isn't a significant
+difference between the two.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+			4.19.0-rc7 vanilla
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    N           Min           Max        Median           Avg        Stddev
+  500         4.658s         6.293s      6.076s      5.846528s    0.45096266
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+			4.19.0-rc7 + big-core-patches
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+    N           Min           Max        Median           Avg        Stddev
+  500         4.543s          6.3s        5.75s      5.682208s   0.50767805
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+
+Gautham R. Shenoy (3):
   powerpc: Detect the presence of big-cores via "ibm,thread-groups"
   powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores
-
- Documentation/ABI/testing/sysfs-devices-system-cpu |   8 ++
- arch/powerpc/include/asm/cputhreads.h              |  22 +++
- arch/powerpc/include/asm/smp.h                     |   6 +
- arch/powerpc/kernel/setup-common.c                 | 154 +++++++++++++++++++++
- arch/powerpc/kernel/smp.c                          |  55 +++++++-
- arch/powerpc/kernel/sysfs.c                        |  35 +++++
- 6 files changed, 276 insertions(+), 4 deletions(-)
+  powerpc/cacheinfo: Report the correct shared_cpu_map on big-cores
+
+ arch/powerpc/include/asm/cputhreads.h |   2 +
+ arch/powerpc/include/asm/smp.h        |  11 ++
+ arch/powerpc/kernel/cacheinfo.c       |  37 +++++-
+ arch/powerpc/kernel/smp.c             | 241 +++++++++++++++++++++++++++++++++-
+ 4 files changed, 288 insertions(+), 3 deletions(-)
 
 -- 
 1.9.4
+
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help