Re: [PATCH v2 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock | linux-arm-kernel

quoted

On 6/12/19 12:38 AM, Alex Kogan wrote:
Hi, Wei.

On Jun 11, 2019, at 12:22 AM, liwei (GF) [off-list ref] wrote:

Hi Alex,

On 2019/3/29 23:20, Alex Kogan wrote:
In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. At the unlock time,
the lock holder scans the main queue looking for a thread running on
the same node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. For more details, see https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1810.05600&d=DwICbg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Hvhk3F4omdCk-GE1PTOm3Kn0A7ApWOZ2aZLTuVxFK4k&m=U7mfTbYj1r2Te2BBUUNbVrRPuTa_ujlpR4GZfUsrGTM&s=Dw4O1EniF-nde4fp6RA9ISlSMOjWuqeR9OS1G0iauj0&e=.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS), which is enabled by default if NUMA is enabled.

Signed-off-by: Alex Kogan <redacted>
Reviewed-by: Steve Sistare <redacted>
---
arch/x86/Kconfig                      |  14 +++
include/asm-generic/qspinlock_types.h |  13 +++
kernel/locking/mcs_spinlock.h         |  10 ++
kernel/locking/qspinlock.c            |  29 +++++-
kernel/locking/qspinlock_cna.h        | 173 ++++++++++++++++++++++++++++++++++
5 files changed, 236 insertions(+), 3 deletions(-)
create mode 100644 kernel/locking/qspinlock_cna.h
(SNIP)
+
+static __always_inline int get_node_index(struct mcs_spinlock *node)
+{
+	return decode_count(node->node_and_count++);
When nesting level is > 4, it won't return a index >= 4 here and the numa node number
is changed by mistake. It will go into a wrong way instead of the following branch.

	/*
	 * 4 nodes are allocated based on the assumption that there will
	 * not be nested NMIs taking spinlocks. That may not be true in
	 * some architectures even though the chance of needing more than
	 * 4 nodes will still be extremely unlikely. When that happens,
	 * we fall back to spinning on the lock directly without using
	 * any MCS node. This is not the most elegant solution, but is
	 * simple enough.
	 */
	if (unlikely(idx >= MAX_NODES)) {
		while (!queued_spin_trylock(lock))
			cpu_relax();
		goto release;
	}
Good point.
This patch does not handle count overflows gracefully.
It can be easily fixed by allocating more bits for the count — we don’t really need 30 bits for #NUMA nodes.
Actually, the default setting uses 2 bits for 4-level nesting and 14
bits for cpu numbers. That means it can support up to 16k-1 cpus. It is
a limit that is likely to be exceeded in the foreseeable future.
qspinlock also supports an additional mode with 21 bits used for cpu
numbers. That can support up to 2M-1 cpus. However, this mode will be a
little bit slower. That is why we don't want to use more than 2 bits for
nesting as I have never see more than 2 level of nesting used in my
testing. So it is highly unlikely we will ever hit more than 4 levels. I
am not saying that it is impossible, though.

Cheers,
Longman

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help