Thread (116 messages) 116 messages, 11 authors, 2017-07-10

Re: [PATCH v2 0/9] Remove spin_unlock_wait()

From: Paul E. McKenney <hidden>
Date: 2017-07-08 14:46:01
Also in: linux-arch, lkml, netfilter-devel

On Sat, Jul 08, 2017 at 02:30:19PM +0200, Ingo Molnar wrote:
* Paul E. McKenney [off-list ref] wrote:
quoted
On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
quoted
* Manfred Spraul [off-list ref] wrote:
quoted
Hi Ingo,

On 07/07/2017 10:31 AM, Ingo Molnar wrote:
quoted
There's another, probably just as significant advantage: queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.
At least for ipc/sem:
Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
hot path.
So for sem_lock(), I either need a primitive that dirties the cacheline or
sem_lock() must continue to use spin_lock()/spin_unlock().
Technically you could use spin_trylock()+spin_unlock() and avoid the lock acquire 
spinning on spin_unlock() and get very close to the slow path performance of a 
pure cacheline-dirtying behavior.

But adding something like spin_barrier(), which purely dirties the lock cacheline, 
would be even faster, right?
Interestingly enough, the arm64 and powerpc implementations of
spin_unlock_wait() were very close to what it sounds like you are
describing.
So could we perhaps solve all our problems by defining the generic version thusly:

void spin_unlock_wait(spinlock_t *lock)
{
	if (spin_trylock(lock))
		spin_unlock(lock);
}

... and perhaps rename it to spin_barrier() [or whatever proper name there would 
be]?
As lockdep, 0day Test Robot, Linus Torvalds, and several others let me
know in response to my original (thankfully RFC!) patch series, this needs
to disable irqs to work in the general case.  For example, if the lock
in question is an irq-disabling lock, you take an interrupt just after
a successful spin_trylock(), and that interrupt acquires the same lock,
the actuarial statistics of your kernel degrade sharply and suddenly.

What I get for sending out untested patches!  :-/
Architectures can still optimize it, to remove the small window where the lock is 
held locally - as long as the ordering is at least as strong as the generic 
version.

This would have various advantages:

 - semantics are well-defined

 - the generic implementation is already pretty well optimized (no spinning)

 - it would make it usable for the IPC performance optimization

 - architectures could still optimize it to eliminate the window where the lock is
   held locally - if there's such instructions available.

Was this proposed before, or am I missing something?
It was sort of proposed...

https://marc.info/?l=linux-arch&m=149912878628355&w=2

But do we have a situation where normal usage of spin_lock() and
spin_unlock() is causing performance or scalability trouble?

(We do have at least one situation in fnic that appears to be buggy use of
spin_is_locked(), and proposing a patch for that case in on my todo list.)

							Thanx, Paul
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help