Re: [PATCH v3 3/5] locking/qspinlock: Introduce CNA into the slow path of qspinlock | linux-arm-kernel

quoted

On 7/16/19 10:26 AM, Alex Kogan wrote:
On Jul 15, 2019, at 5:30 PM, Waiman Long [off-list ref] wrote:

On 7/15/19 3:25 PM, Alex Kogan wrote:

/*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
- * size and four of them will fit nicely in one 64-byte cacheline. For
- * pvqspinlock, however, we need more space for extra data. To accommodate
- * that, we insert two more long words to pad it up to 32 bytes. IOW, only
- * two of them can fit in a cacheline in this case. That is OK as it is rare
- * to have more than 2 levels of slowpath nesting in actual use. We don't
- * want to penalize pvqspinlocks to optimize for a rare case in native
- * qspinlocks.
+ * On 64-bit architectures, the mcs_spinlock structure will be 20 bytes in
+ * size. For pvqspinlock or the NUMA-aware variant, however, we need more
+ * space for extra data. To accommodate that, we insert two more long words
+ * to pad it up to 36 bytes.
 */
The 20 bytes figure is wrong. It is actually 24 bytes for 64-bit as the
mcs_spinlock structure is 8-byte aligned. For better cacheline
alignment, I will like to keep mcs_spinlock to 16 bytes as before.
Instead, you can use encode_tail() to store the CNA node pointer in
"locked". For instance, use (encode_tail() << 1) in locked to
distinguish it from the regular locked=1 value.
I think this can work.
decode_tail() will get the actual node pointer from the encoded value.
And that would keep the size of mcs_spinlock intact.
Good idea, thanks!

BTW, maybe better change those function names to encode_node() / decode_node() then?
The names look good to me.

s
+
+static void cna_init_node(struct mcs_spinlock *node)
+{
+	struct cna_node *cn = CNA_NODE(node);
+	struct mcs_spinlock *base_node;
+	int cpuid;
+
+	BUILD_BUG_ON(sizeof(struct cna_node) > sizeof(struct qnode));
+	/* we store a pointer in the node's @locked field */
+	BUILD_BUG_ON(sizeof(uintptr_t) > sizeof_field(struct mcs_spinlock, locked));
+
+	cpuid = smp_processor_id();
+	cn->numa_node = cpu_to_node(cpuid);
+
+	base_node = this_cpu_ptr(&qnodes[0].mcs);
+	cn->encoded_tail = encode_tail(cpuid, base_node->count - 1);
+}
I think you can use an early_init call to initialize the numa_node and
encoded_tail values for all the per-cpu CNA nodes instead of doing it
every time a node is used. If it turns out that pv_qspinlock is used,
the pv_node_init() will properly re-initialize it.
Yes, this should work. Thanks.

BTW, should not we initialize `cpu` in pv_init_node() that same way?
We would also initialize cpu this way in pv_init_node. The
smp_processor_id() call is relatively cheap, but the initialization done
here is more expensive.

The only thing left
to do here is perhaps setting tail to NULL.
There is no need to initialize cna_node.tail — we never access it unless
the node is at the head of the secondary queue, and in that case we 
initialize it before placing the node at the head of that queue 
(see find_successor()).
OK.

-Longman

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help