Re: [PATCH 0/3] Add NUMA-awareness to qspinlock | linux-arm-kernel

quoted

On Jan 31, 2019, at 4:56 AM, Peter Zijlstra [off-list ref] wrote:

On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:
Lock throughput can be increased by handing a lock to a waiter on the
same NUMA socket as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA sockets. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock.
Since you use NUMA, use the term node, not socket. The two are not
strictly related.
Got it, thanks.

CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are
organized in two queues, a main queue for threads running on the same
socket as the current lock holder, and a secondary queue for threads
running on other sockets. Threads record the ID of the socket on which
they are running in their queue nodes. At the unlock time, the lock
holder scans the main queue looking for a thread running on the same
socket. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue.

Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=.
Full details really should also be in the Changelog. You can skip much
of the academic bla-bla, but the Changelog should be self contained.

We have done some performance evaluation with the locktorture module
as well as with several benchmarks from the will-it-scale repo.
The following locktorture results are from an Oracle X5-4 server
(four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded
cores each). Each number represents an average (over 5 runs) of the
total number of ops (x10^7) reported at the end of each run. The stock
kernel is v4.20.0-rc4+ compiled in the default configuration.

#thr  stock  patched speedup (patched/stock)
 1   2.710   2.715  1.002
 2   3.108   3.001  0.966
 4   4.194   3.919  0.934
So low contention is actually worse. Funnily low contention is the
majority of our locks and is _really_ important.
This can be most certainly engineered out, e.g., by caching the node ID on which a task is running.
We will look into that.

 8   5.309   6.894  1.299
16   6.722   9.094  1.353
32   7.314   9.885  1.352
36   7.562   9.855  1.303
72   6.696  10.358  1.547
108   6.364  10.181  1.600
142   6.179  10.178  1.647

When the kernel is compiled with lockstat enabled, CNA 
I'll ignore that, lockstat/lockdep enabled runs are not what one would
call performance relevant.
Please, note that only one set of results has lockstat enabled.
The rest of the results (will-it-scale included) do not have it.

Regards,
— Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help