Thread (76 messages) 76 messages, 8 authors, 2021-08-23

Re: [PATCH v4 00/35] SLUB: reduce irq disabled scope and make it RT compatible

From: Mike Galbraith <hidden>
Date: 2021-08-06 05:15:36
Also in: lkml

On Thu, 2021-08-05 at 18:42 +0200, Sebastian Andrzej Siewior wrote:
There was throughput regression in RT compared to previous releases
(without this series). The regression was (based on my testing) only
visible in hackbench and was addressed by adding adaptiv spinning to
RT-mutex. With that we almost back to what we had before :)
Numbers on my box say a throughput regression remains (silly fork bomb
scenario.. yawn), which can be recouped by either turning on all
SL[AU]B features or converting the list_lock to a raw lock.  They also
seem to be saying that if you turned on PREEMPT_RT because you care
about RT performance first and foremost (gee), you'll do neither of
those, because either will eliminate an RT performance progression.

	-Mike

numbers...

box is old i4790 desktop
perf stat -r10 hackbench -s4096 -l500
full warmup, record, repeat twice for elapsed

SLUB+SLUB_DEBUG only

begin previously reported numbers
5.14.0.g79e92006-tip-rt (5.12-rt based as before, 5.13-rt didn't yet exist)
          7,984.52 msec task-clock                #    7.565 CPUs utilized            ( +-  0.66% )
           353,566      context-switches          #   44.281 K/sec                    ( +-  2.77% )
            37,685      cpu-migrations            #    4.720 K/sec                    ( +-  6.37% )
            12,939      page-faults               #    1.620 K/sec                    ( +-  0.67% )
    29,901,079,227      cycles                    #    3.745 GHz                      ( +-  0.71% )
    14,550,797,818      instructions              #    0.49  insn per cycle           ( +-  0.47% )
     3,056,685,643      branches                  #  382.826 M/sec                    ( +-  0.51% )
         9,598,083      branch-misses             #    0.31% of all branches          ( +-  2.11% )

           1.05542 +- 0.00409 seconds time elapsed  ( +-  0.39% )
           1.05990 +- 0.00244 seconds time elapsed  ( +-  0.23% ) (repeat)
           1.05367 +- 0.00303 seconds time elapsed  ( +-  0.29% ) (repeat)

5.14.0.g79e92006-tip-rt +slub-local-lock-v2r3 -0034-mm-slub-convert-kmem_cpu_slab-protection-to-local_lock.patch
          6,899.35 msec task-clock                #    5.637 CPUs utilized            ( +-  0.53% )
           420,304      context-switches          #   60.919 K/sec                    ( +-  2.83% )
           187,130      cpu-migrations            #   27.123 K/sec                    ( +-  1.81% )
            13,206      page-faults               #    1.914 K/sec                    ( +-  0.96% )
    25,110,362,933      cycles                    #    3.640 GHz                      ( +-  0.49% )
    15,853,643,635      instructions              #    0.63  insn per cycle           ( +-  0.64% )
     3,366,261,524      branches                  #  487.910 M/sec                    ( +-  0.70% )
        14,839,618      branch-misses             #    0.44% of all branches          ( +-  2.01% )

           1.22390 +- 0.00744 seconds time elapsed  ( +-  0.61% )
           1.21813 +- 0.00907 seconds time elapsed  ( +-  0.74% ) (repeat)
           1.22097 +- 0.00952 seconds time elapsed  ( +-  0.78% ) (repeat)

repeat of above with raw list_lock
          8,072.62 msec task-clock                #    7.605 CPUs utilized            ( +-  0.49% )
           359,514      context-switches          #   44.535 K/sec                    ( +-  4.95% )
            35,285      cpu-migrations            #    4.371 K/sec                    ( +-  5.82% )
            13,503      page-faults               #    1.673 K/sec                    ( +-  0.96% )
    30,247,989,681      cycles                    #    3.747 GHz                      ( +-  0.52% )
    14,580,011,391      instructions              #    0.48  insn per cycle           ( +-  0.81% )
     3,063,743,405      branches                  #  379.523 M/sec                    ( +-  0.85% )
         8,907,160      branch-misses             #    0.29% of all branches          ( +-  3.99% )

           1.06150 +- 0.00427 seconds time elapsed  ( +-  0.40% )
           1.05041 +- 0.00176 seconds time elapsed  ( +-  0.17% ) (repeat)
           1.06086 +- 0.00237 seconds time elapsed  ( +-  0.22% ) (repeat)

5.14.0.g79e92006-rt3-tip-rt +slub-local-lock-v2r3 full set
          7,598.44 msec task-clock                #    5.813 CPUs utilized            ( +-  0.85% )
           488,161      context-switches          #   64.245 K/sec                    ( +-  4.29% )
           196,866      cpu-migrations            #   25.909 K/sec                    ( +-  1.49% )
            13,042      page-faults               #    1.716 K/sec                    ( +-  0.73% )
    27,695,116,746      cycles                    #    3.645 GHz                      ( +-  0.79% )
    18,423,934,168      instructions              #    0.67  insn per cycle           ( +-  0.88% )
     3,969,540,695      branches                  #  522.415 M/sec                    ( +-  0.92% )
        15,493,482      branch-misses             #    0.39% of all branches          ( +-  2.15% )

           1.30709 +- 0.00890 seconds time elapsed  ( +-  0.68% )
           1.3205 +- 0.0134 seconds time elapsed  ( +-  1.02% ) (repeat)
           1.3083 +- 0.0132 seconds time elapsed  ( +-  1.01% ) (repeat)
end previously reported numbers

5.14.0.gf6a71a5-rt6-tip-rt (same config, full slub set.. obviously)
          7,707.63 msec task-clock                #    5.880 CPUs utilized            ( +-  1.46% )
           562,533      context-switches          #   72.984 K/sec                    ( +-  7.46% )
           208,475      cpu-migrations            #   27.048 K/sec                    ( +-  2.26% )
            13,022      page-faults               #    1.689 K/sec                    ( +-  0.80% )
    28,025,004,779      cycles                    #    3.636 GHz                      ( +-  1.34% )
    18,487,135,489      instructions              #    0.66  insn per cycle           ( +-  1.58% )
     3,997,110,493      branches                  #  518.591 M/sec                    ( +-  1.65% )
        16,078,322      branch-misses             #    0.40% of all branches          ( +-  4.23% )

            1.3108 +- 0.0135 seconds time elapsed  ( +-  1.03% )
            1.2997 +- 0.0138 seconds time elapsed  ( +-  1.06% ) (repeat)
            1.3009 +- 0.0166 seconds time elapsed  ( +-  1.28% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +list_lock=raw_spinlock_t
          8,252.59 msec task-clock                #    7.584 CPUs utilized            ( +-  0.27% )
           400,991      context-switches          #   48.590 K/sec                    ( +-  6.15% )
            35,979      cpu-migrations            #    4.360 K/sec                    ( +-  5.63% )
            13,261      page-faults               #    1.607 K/sec                    ( +-  0.73% )
    30,910,310,737      cycles                    #    3.746 GHz                      ( +-  0.31% )
    16,522,383,240      instructions              #    0.53  insn per cycle           ( +-  0.92% )
     3,535,219,839      branches                  #  428.377 M/sec                    ( +-  0.96% )
        10,115,967      branch-misses             #    0.29% of all branches          ( +-  4.32% )

           1.08817 +- 0.00238 seconds time elapsed  ( +-  0.22% )
           1.08583 +- 0.00243 seconds time elapsed  ( +-  0.22% ) (repeat)
           1.09003 +- 0.00164 seconds time elapsed  ( +-  0.15% ) (repeat)

5.14.0.g251a152-rt6-master-rt (+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED)
          8,170.48 msec task-clock                #    7.390 CPUs utilized            ( +-  0.43% )
           449,994      context-switches          #   55.076 K/sec                    ( +-  4.20% )
            55,912      cpu-migrations            #    6.843 K/sec                    ( +-  4.28% )
            13,144      page-faults               #    1.609 K/sec                    ( +-  0.53% )
    30,484,114,812      cycles                    #    3.731 GHz                      ( +-  0.44% )
    17,554,521,787      instructions              #    0.58  insn per cycle           ( +-  0.76% )
     3,751,725,852      branches                  #  459.181 M/sec                    ( +-  0.81% )
        13,421,985      branch-misses             #    0.36% of all branches          ( +-  2.40% )

           1.10563 +- 0.00382 seconds time elapsed  ( +-  0.35% )
           1.1098 +- 0.0147 seconds time elapsed  ( +-  1.32% ) (repeat)
           1.11308 +- 0.00883 seconds time elapsed  ( +-  0.79% ) (repeat)

5.14.0.gf6a71a5-rt6-tip-rt +SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
          8,026.39 msec task-clock                #    7.320 CPUs utilized            ( +-  0.70% )
           496,579      context-switches          #   61.868 K/sec                    ( +-  6.78% )
            65,022      cpu-migrations            #    8.101 K/sec                    ( +-  8.29% )
            13,161      page-faults               #    1.640 K/sec                    ( +-  0.51% )
    29,870,954,733      cycles                    #    3.722 GHz                      ( +-  0.67% )
    17,617,522,235      instructions              #    0.59  insn per cycle           ( +-  1.36% )
     3,760,346,459      branches                  #  468.498 M/sec                    ( +-  1.45% )
        12,863,520      branch-misses             #    0.34% of all branches          ( +-  4.45% )

            1.0965 +- 0.0103 seconds time elapsed  ( +-  0.94% )
            1.08149 +- 0.00362 seconds time elapsed  ( +-  0.33% ) (repeat)
            1.10027 +- 0.00916 seconds time elapsed  ( +-  0.83% )

yup, perf delta == config delta, lets have a peek at jitter

cyclictest -Smqp99& perf stat -r100 hackbench -s4096 -l500 && killall cyclictest

5.14.0.gf6a71a5-rt6-tip-rt
SLUB+SLUB_DEBUG
T: 1 ( 5903) P:99 I:1500 C:  92330 Min:      1 Act:    2 Avg:    6 Max:      19
T: 2 ( 5904) P:99 I:2000 C:  69247 Min:      1 Act:    2 Avg:    6 Max:      21
T: 3 ( 5905) P:99 I:2500 C:  55395 Min:      1 Act:    3 Avg:    6 Max:      22
T: 4 ( 5906) P:99 I:3000 C:  46163 Min:      1 Act:    4 Avg:    7 Max:      22
T: 5 ( 5907) P:99 I:3500 C:  39568 Min:      1 Act:    3 Avg:    6 Max:      23
T: 6 ( 5909) P:99 I:4000 C:  34621 Min:      1 Act:    2 Avg:    7 Max:      22
T: 7 ( 5910) P:99 I:4500 C:  30774 Min:      1 Act:    3 Avg:    7 Max:      18

SLUB+SLUB_DEBUG+list_lock=raw_spinlock_t
T: 1 ( 4044) P:99 I:1500 C:  73340 Min:      1 Act:    3 Avg:   10 Max:      28
T: 2 ( 4045) P:99 I:2000 C:  55004 Min:      1 Act:    4 Avg:   10 Max:      33
T: 3 ( 4046) P:99 I:2500 C:  44002 Min:      1 Act:    2 Avg:   10 Max:      26
T: 4 ( 4047) P:99 I:3000 C:  36668 Min:      1 Act:    3 Avg:   10 Max:      24
T: 5 ( 4048) P:99 I:3500 C:  31429 Min:      1 Act:    3 Avg:   10 Max:      27
T: 6 ( 4049) P:99 I:4000 C:  27500 Min:      1 Act:    3 Avg:   11 Max:      30
T: 7 ( 4050) P:99 I:4500 C:  24444 Min:      1 Act:    4 Avg:   11 Max:      25

SLUB+SLUB_DEBUG+SLAB_MERGE_DEFAULT,SLUB_CPU_PARTIAL,SLAB_FREELIST_RANDOM/HARDENED
T: 1 ( 4036) P:99 I:1500 C:  74039 Min:      1 Act:    3 Avg:    9 Max:      31
T: 2 ( 4037) P:99 I:2000 C:  55528 Min:      1 Act:    3 Avg:   10 Max:      29
T: 3 ( 4038) P:99 I:2500 C:  44422 Min:      1 Act:    2 Avg:   10 Max:      31
T: 4 ( 4039) P:99 I:3000 C:  37017 Min:      1 Act:    2 Avg:    9 Max:      23
T: 5 ( 4040) P:99 I:3500 C:  31729 Min:      1 Act:    3 Avg:   10 Max:      29
T: 6 ( 4041) P:99 I:4000 C:  27762 Min:      1 Act:    2 Avg:    8 Max:      26
T: 7 ( 4042) P:99 I:4500 C:  24677 Min:      1 Act:    3 Avg:    9 Max:      27

conclusion: gee, pi both works and ain't free - ditto add more stuff=cycles :)




Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help