Re: Re: [PATCH RFC 1/2] arch: Introduce ARCH_HAS_HW_XCHG_SMALL
From: Peter Zijlstra <peterz@infradead.org>
Date: 2021-07-27 11:03:24
On Tue, Jul 27, 2021 at 09:52:26AM +0800, Wang Rui wrote:
I think the forward progress are guaranteed while all operations are
atomic(ll/sc or amo). If ll/sc runs on a fast cpu, there will be
random delays, is that okay? Else, for such hardware, we can't even
implement generic spinlock with ll/sc.
And I also think that the hardware supports normal store for
unlocking. (e.g. arch_spin_unlock)
In qspinlock, when _Q_PENDING_BITS == 1, it's available for all
hardware, because the clear_pending/clear_pending_set_locked are all
atomic operations. Isn't it?
Q: Why live lock happens while _Q_PENDING_BITS == 8?
A: I found a case is:
* CPU A updates sub-word of qpsinlock at high frequency with normal store.
* CPU B do xchg_tail with load + cmpxchg, and the value of load is always not equal to the value of ll(cmpxchg).
qspinlock:
0: locked
1: pending
2: tail
CPU A CPU B
1: 1: <--------------------+
sh $newval, &locked lw $v1, &qspinlock |
add $newval, 1 and $t1, $v1, ~mask |
b 1b or $t1, $t1, newval | (live lock path)
ll $v2, &qspinlock |
bne $v1, $v2, 1b -----+
sc $t1, &qspinlock
beq $t1, 0, 1b
If xchg_tail like this, at least there is no live lock on Loongson
xchg_tail:
1:
ll $v1, &qspinlock
and $t1, $v1, ~mask
or $t1, $t1, newval
sc $t1, &qspinlock
beq $t1, 0, 1b
For hardware that ll/sc is based on cache coherency, I think sc is
easy to succeed. The ll makes cache-line is exclusive by CPU B, and
the store of CPU A needs to acquire exclusive again, the sc may be
completed before this.This! I've been saying this for ages. All those xchg16() implementations are broken for using cmpxchg() on LL/SC. Not because xchg16() is fundamentally flawed. Perhaps we should introduce: atomic_nand_or() and atomic_fetch_nand_or() and implement short xchg() using those, then we can have the whole masks setup shared. It just means you get to implement those primitives for *all* archs :-) Also, the _Q_PENDING_BITS==1 case can use that primitive.