On Thu, Feb 14, 2019 at 10:09:44AM -0800, Linus Torvalds wrote:
On Thu, Feb 14, 2019 at 9:51 AM Linus Torvalds
[off-list ref] wrote:
quoted
The arm64 numbers scaled horribly even before, and that's because
there is too much ping-pong, and it's probably because there is no
"stickiness" to the cacheline to the core, and thus adding the extra
loop can make the ping-pong issue even worse because now there is more
of it.
Actually, if it's using the ll/sc, then I don't see why arm64 should
even change. It doesn't really even change the pattern: the initial
load of the value is just replaced with a "ll" that gets a non-zero
value, and then we re-try without even doing the "sc" part.
So our cmpxchg() has a prefetch-with-intent-to-modify instruction before the
'll' part, which will attempt to grab the line unique the first time round.
The 'll' also has acquire semantics, so there's the chance for the
micro-architecture to handle that badly too.
I think that the problem with the proposed changed change is that whenever a
reader tries to acquire an rwsem that is already held for read, it will
always fail the first cmpxchg(), so in this situation the read path is
considerably slower than before.
End result: exact same "load once, then do ll/sc to update". Just
using a slightly different instruction pattern.
But maybe "ll" does something different to the cacheline than a regular "ld"?
Alternatively, the machine you used is using LSE, and the "swp" thing
has some horrid behavior when it fails.
Depending on where the data is, the LSE instructions may execute outside of
the CPU (e.g. in a cache controller) and so could add latency to a failing
CAS.
Will