Re: [RFC v2 00/34] SLUB: reduce irq disabled scope and make it RT compatible
From: Vlastimil Babka <hidden>
Date: 2021-07-02 20:25:36
Also in:
lkml
On 7/2/21 8:29 PM, Sebastian Andrzej Siewior wrote:
I replaced my slub changes with slub-local-lock-v2r3. I haven't seen any complains from lockdep or so which is good. Then I did this with RT enabled (and no debug):
Thanks for testing!
- A "time make -j32" run of allmodconfig on /dev/shm. Old: | real 20m6,217s | user 568m22,553s | sys 48m33,126s New: | real 20m9,049s | user 569m32,096s | sys 48m47,670s These 3 seconds here are probably in the noise range. - perf_5.10 stat -r 10 hackbench -g200 -s 4096 -l500 Old: | 464.967,20 msec task-clock # 27,220 CPUs utilized ( +- 0,16% ) | 7.683.944 context-switches # 0,017 M/sec ( +- 0,86% ) | 931.380 cpu-migrations # 0,002 M/sec ( +- 4,94% ) | 219.569 page-faults # 0,472 K/sec ( +- 0,39% ) | 1.104.727.599.918 cycles # 2,376 GHz ( +- 0,18% ) | 941.428.898.087 stalled-cycles-frontend # 85,22% frontend cycles idle ( +- 0,24% ) | 729.016.546.572 stalled-cycles-backend # 65,99% backend cycles idle ( +- 0,32% ) | 340.133.571.519 instructions # 0,31 insn per cycle | # 2,77 stalled cycles per insn ( +- 0,12% ) | 73.746.821.314 branches # 158,607 M/sec ( +- 0,13% ) | 377.838.006 branch-misses # 0,51% of all branches ( +- 1,01% ) | | 17,0820 +- 0,0202 seconds time elapsed ( +- 0,12% ) New: | 422.865,71 msec task-clock # 4,782 CPUs utilized ( +- 0,34% ) | 14.594.238 context-switches # 0,035 M/sec ( +- 0,43% ) | 3.737.926 cpu-migrations # 0,009 M/sec ( +- 0,46% ) | 218.474 page-faults # 0,517 K/sec ( +- 0,74% ) | 940.715.812.020 cycles # 2,225 GHz ( +- 0,34% ) | 716.593.827.820 stalled-cycles-frontend # 76,18% frontend cycles idle ( +- 0,39% ) | 550.730.862.839 stalled-cycles-backend # 58,54% backend cycles idle ( +- 0,43% ) | 417.274.588.907 instructions # 0,44 insn per cycle | # 1,72 stalled cycles per insn ( +- 0,17% ) | 92.814.150.290 branches # 219,488 M/sec ( +- 0,17% ) | 822.102.170 branch-misses # 0,89% of all branches ( +- 0,41% ) | | 88,427 +- 0,618 seconds time elapsed ( +- 0,70% ) So this is outside of the noise range. I'm not sure where this is coming from. My guess would be higher lock contention within the memory allocator.
The series shouldn't significantly change the memory allocator interaction, though. Seems there's less cycles, but more time elapsed, thus more sleeping - is it locks becoming mutexes on RT? My first guess - the last, local_lock patch. What would happen if you take that one out? Should be still RT-compatible. If it improves a lot, maybe that conversion to local_lock is not worth it then. My second guess - list_lock remains spinlock with my series, thus RT mutex, but the current RT tree converts it to raw_spinlock. I'd hope leaving that one as non-raw spinlock would still be much better for RT goals, even if hackbench (which is AFAIK very slab intensive) throughput regresses - hopefully not that much.
quoted
The remaining patches to upstream from the RT tree are small ones related to KConfig. The patch that restricts PREEMPT_RT to SLUB (not SLAB or SLOB) makes sense. The patch that disables CONFIG_SLUB_CPU_PARTIAL with PREEMPT_RT could perhaps be re-evaluated as the series also addresses some latency issues with percpu partial slabs.With that series the PARTIAL slab can be indeed enabled. I have (had) a half done series where I had PARTIAL enabled and noticed a slight increase in latency so made it "default y on !RT". It wasn't dramatic but appeared to be outside of noise. Sebastian