Re: [PATCH v9 0/5] Add NUMA-awareness to qspinlock | linux-arm-kernel

quoted

Hi, Lihao.

On Jan 22, 2020, at 6:45 AM, Lihao Liang [off-list ref] wrote:

Hi Alex,

On Wed, Jan 22, 2020 at 10:28 AM Alex Kogan [off-list ref] wrote:
Summary
-------

Lock throughput can be increased by handing a lock to a waiter on the
same NUMA node as the lock holder, provided care is taken to avoid
starvation of waiters on other NUMA nodes. This patch introduces CNA
(compact NUMA-aware lock) as the slow path for qspinlock. It is
enabled through a configuration option (NUMA_AWARE_SPINLOCKS).
Thanks for your patches. The experimental results look promising!

I understand that the new CNA qspinlock uses randomization to achieve
long-term fairness, and provides the numa_spinlock_threshold parameter
for users to tune.
This has been the case in the first versions of the series, but is not true anymore.
That is, the long-term fairness is achieved deterministically (and you are correct 
that it is done through the numa_spinlock_threshold parameter).

As Linux runs extremely diverse workloads, it is not
clear how randomization affects its fairness, and how users with
different requirements are supposed to tune this parameter.

To this end, Will and I consider it beneficial to be able to answer the
following question:

With different values of numa_spinlock_threshold and
SHUFFLE_REDUCTION_PROB_ARG, how long do threads running on different
sockets have to wait to acquire the lock?
The SHUFFLE_REDUCTION_PROB_ARG parameter is intended for performance
optimization only, and *does not* affect the long-term fairness (or, at the 
very least, does not make it any worse). As Longman correctly pointed out in 
his response to this email, the shuffle reduction optimization is relevant only
when the secondary queue is empty. In that case, CNA hands-off the lock
exactly as MCS does, i.e., in the FIFO order. Note that when the secondary
queue is not empty, we do not call probably().

This is particularly relevant
in high contention situations when new threads keep arriving on the same
socket as the lock holder.
In this case, the lock will stay on the same NUMA node/socket for 
2^numa_spinlock_threshold times, which is the worst case scenario if we 
consider the long-term fairness. And if we have multiple nodes, it will take 
up to 2^numa_spinlock_threshold X (nr_nodes - 1) + nr_cpus_per_node
lock transitions until any given thread will acquire the lock
(assuming 2^numa_spinlock_threshold > nr_cpus_per_node).

Hopefully, it addresses your concern. Let me know if you have any further 
questions.

Best regards,
— Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help