Question on mutex code

From: dave@stgolabs.net (Davidlohr Bueso)
Date: 2015-03-15 01:03:31
Also in: lkml

On Sun, 2015-03-15 at 01:05 +0200, Matthias Bonne wrote:

On 03/10/15 15:03, Yann Droneaud wrote:

quoted

Hi,

Le mercredi 04 mars 2015 ? 02:13 +0200, Matthias Bonne a ?crit :

quoted

I am trying to understand how mutexes work in the kernel, and I think
there might be a race between mutex_trylock() and mutex_unlock(). More
specifically, the race is between the functions
__mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both
defined in kernel/locking/mutex.c).

Consider the following sequence of events:

[...]

quoted

The end result is that the mutex count is 0 (locked), although the
owner has just released it, and nobody else is holding the mutex. So it
can no longer be acquired by anyone.

Am I missing something that prevents the above scenario from happening?
If not, should I post a patch that fixes it to LKML? Or is it
considered too "theoretical" and cannot happen in practice?

I haven't looked at your explanations, you should have come with a
reproductible test case to demonstrate the issue (involving slowing
down one CPU ?).

Anyway, such deep knowledge on the mutex implementation has to be found
on lkml.

Regards.

Thank you for your suggestions, and sorry for the long delay.

I see now that my explanation was unneccesarily complex. The problem is
this code from __mutex_trylock_slowpath():

         spin_lock_mutex(&lock->wait_lock, flags);

         prev = atomic_xchg(&lock->count, -1);
         if (likely(prev == 1)) {
                 mutex_set_owner(lock);
                 mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
         }

         /* Set it back to 0 if there are no waiters: */
         if (likely(list_empty(&lock->wait_list)))
                 atomic_set(&lock->count, 0);

         spin_unlock_mutex(&lock->wait_lock, flags);

         return prev == 1;

The above code assumes that the mutex cannot be unlocked while the
spinlock is held. However, mutex_unlock() sets the mutex count to 1
before taking the spinlock (even in the slowpath). If this happens
between the atomic_xchg() and the atomic_set() above, and the mutex has
no waiters, then the atomic_set() will set the mutex count back to 0
after it has been unlocked by mutex_unlock(), but mutex_trylock() will
still return failure. So the mutex will remain locked forever.

Good analysis, but not quite accurate for one simple fact: mutex
trylocks _only_ use fastpaths (obviously just depend on the counter
cmpxchg to 0), so you never fallback to the slowpath you are mentioning,
thus the race is non existent. Please see the arch code.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help