Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
From: Nicholas Piggin <npiggin@gmail.com>
Date: 2020-07-07 05:57:13
Also in:
linux-arch, lkml, virtualization
Excerpts from Waiman Long's message of July 7, 2020 4:39 am:
On 7/6/20 12:35 AM, Nicholas Piggin wrote:quoted
v3 is updated to use __pv_queued_spin_unlock, noticed by Waiman (thank you). Thanks, Nick Nicholas Piggin (6): powerpc/powernv: must include hvcall.h to get PAPR defines powerpc/pseries: move some PAPR paravirt functions to their own file powerpc: move spinlock implementation to simple_spinlock powerpc/64s: implement queued spinlocks and rwlocks powerpc/pseries: implement paravirt qspinlocks for SPLPAR powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint arch/powerpc/Kconfig | 13 + arch/powerpc/include/asm/Kbuild | 2 + arch/powerpc/include/asm/atomic.h | 28 ++ arch/powerpc/include/asm/paravirt.h | 89 +++++ arch/powerpc/include/asm/qspinlock.h | 91 ++++++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 + arch/powerpc/include/asm/simple_spinlock.h | 292 +++++++++++++++++ .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 308 +----------------- arch/powerpc/include/asm/spinlock_types.h | 17 +- arch/powerpc/lib/Makefile | 3 + arch/powerpc/lib/locks.c | 12 +- arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 + arch/powerpc/platforms/pseries/Kconfig | 5 + arch/powerpc/platforms/pseries/setup.c | 6 +- include/asm-generic/qspinlock.h | 4 + 16 files changed, 577 insertions(+), 322 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.hThis patch looks OK to me.
Thanks for reviewing and testing.
I had run some microbenchmark on powerpc system with or w/o the patch. On a 2-socket 160-thread SMT4 POWER9 system (not virtualized): 5.8.0-rc4 ========= Running locktest with spinlock [runtime = 10s, load = 1] Threads = 160, Min/Mean/Max = 77,665/90,153/106,895 Threads = 160, Total Rate = 1,441,759 op/s; Percpu Rate = 9,011 op/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 160, Min/Mean/Max = 47,879/53,807/63,689 Threads = 160, Total Rate = 860,192 op/s; Percpu Rate = 5,376 op/s Running locktest with spinlock [runtime = 10s, load = 1] Threads = 80, Min/Mean/Max = 242,907/319,514/463,161 Threads = 80, Total Rate = 2,555 kop/s; Percpu Rate = 32 kop/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 80, Min/Mean/Max = 146,161/187,474/259,270 Threads = 80, Total Rate = 1,498 kop/s; Percpu Rate = 19 kop/s Running locktest with spinlock [runtime = 10s, load = 1] Threads = 40, Min/Mean/Max = 646,639/1,000,817/1,455,205 Threads = 40, Total Rate = 4,001 kop/s; Percpu Rate = 100 kop/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 40, Min/Mean/Max = 402,165/597,132/814,555 Threads = 40, Total Rate = 2,388 kop/s; Percpu Rate = 60 kop/s 5.8.0-rc4-qlock+ ================ Running locktest with spinlock [runtime = 10s, load = 1] Threads = 160, Min/Mean/Max = 123,835/124,580/124,587 Threads = 160, Total Rate = 1,992 kop/s; Percpu Rate = 12 kop/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 160, Min/Mean/Max = 254,210/264,714/276,784 Threads = 160, Total Rate = 4,231 kop/s; Percpu Rate = 26 kop/s Running locktest with spinlock [runtime = 10s, load = 1] Threads = 80, Min/Mean/Max = 599,715/603,397/603,450 Threads = 80, Total Rate = 4,825 kop/s; Percpu Rate = 60 kop/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 80, Min/Mean/Max = 492,687/525,224/567,456 Threads = 80, Total Rate = 4,199 kop/s; Percpu Rate = 52 kop/s Running locktest with spinlock [runtime = 10s, load = 1] Threads = 40, Min/Mean/Max = 1,325,623/1,325,628/1,325,636 Threads = 40, Total Rate = 5,299 kop/s; Percpu Rate = 132 kop/s Running locktest with rwlock [runtime = 10s, r% = 50%, load = 1] Threads = 40, Min/Mean/Max = 1,249,731/1,292,977/1,342,815 Threads = 40, Total Rate = 5,168 kop/s; Percpu Rate = 129 kop/s On systems on large number of cpus, qspinlock lock is faster and more fair. With some tuning, we may be able to squeeze out more performance.
Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. And then there seem to be one or two tunable parameters we could experiment with. The paravirt locks may need a bit more tuning. Some simple testing under KVM shows we might be a bit slower in some cases. Whether this is fairness or something else I'm not sure. The current simple pv spinlock code can do a directed yield to the lock holder CPU, whereas the pv qspl here just does a general yield. I think we might actually be able to change that to also support directed yield. Though I'm not sure if this is actually the cause of the slowdown yet. Thanks, Nick