Re: [PATCH 0/3] Add NUMA-awareness to qspinlock
From: Alex Kogan <hidden>
Date: 2019-02-01 21:21:39
Also in:
linux-arch, lkml
On Jan 31, 2019, at 4:56 AM, Peter Zijlstra [off-list ref] wrote: On Wed, Jan 30, 2019 at 10:01:32PM -0500, Alex Kogan wrote:quoted
Lock throughput can be increased by handing a lock to a waiter on the same NUMA socket as the lock holder, provided care is taken to avoid starvation of waiters on other NUMA sockets. This patch introduces CNA (compact NUMA-aware lock) as the slow path for qspinlock.Since you use NUMA, use the term node, not socket. The two are not strictly related.
Got it, thanks.
quoted
CNA is a NUMA-aware version of the MCS spin-lock. Spinning threads are organized in two queues, a main queue for threads running on the same socket as the current lock holder, and a secondary queue for threads running on other sockets. Threads record the ID of the socket on which they are running in their queue nodes. At the unlock time, the lock holder scans the main queue looking for a thread running on the same socket. If found (call it thread T), all threads in the main queue between the current lock holder and T are moved to the end of the secondary queue, and the lock is passed to T. If such T is not found, the lock is passed to the first node in the secondary queue. Finally, if the secondary queue is empty, the lock is passed to the next thread in the main queue. Full details are available at https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=7sFZrsdpLJxLRHIFWN_sE6zgKy20Ti8lOoepiEyipAo&s=5VRAQVjw0B1SCjvBLzzwxkHQ6TZ3FIl_tGDfvn3FXvo&e=.Full details really should also be in the Changelog. You can skip much of the academic bla-bla, but the Changelog should be self contained.quoted
We have done some performance evaluation with the locktorture module as well as with several benchmarks from the will-it-scale repo. The following locktorture results are from an Oracle X5-4 server (four Intel Xeon E7-8895 v3 @ 2.60GHz sockets with 18 hyperthreaded cores each). Each number represents an average (over 5 runs) of the total number of ops (x10^7) reported at the end of each run. The stock kernel is v4.20.0-rc4+ compiled in the default configuration. #thr stock patched speedup (patched/stock) 1 2.710 2.715 1.002 2 3.108 3.001 0.966 4 4.194 3.919 0.934So low contention is actually worse. Funnily low contention is the majority of our locks and is _really_ important.
This can be most certainly engineered out, e.g., by caching the node ID on which a task is running. We will look into that.
quoted
8 5.309 6.894 1.299 16 6.722 9.094 1.353 32 7.314 9.885 1.352 36 7.562 9.855 1.303 72 6.696 10.358 1.547 108 6.364 10.181 1.600 142 6.179 10.178 1.647 When the kernel is compiled with lockstat enabled, CNAI'll ignore that, lockstat/lockdep enabled runs are not what one would call performance relevant.
Please, note that only one set of results has lockstat enabled. The rest of the results (will-it-scale included) do not have it. Regards, — Alex _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel