Thread (92 messages) 92 messages, 7 authors, 2021-11-03

Re: [PATCH v8 09/12] sysfs: fix deadlock race with module removal

From: Luis Chamberlain <mcgrof@kernel.org>
Date: 2021-10-12 21:18:44
Also in: linux-block, linux-fsdevel, linux-kselftest, lkml

On Tue, Oct 12, 2021 at 08:20:46AM +0800, Ming Lei wrote:
On Mon, Oct 11, 2021 at 02:25:46PM -0700, Luis Chamberlain wrote:
quoted
On Tue, Oct 05, 2021 at 05:24:18PM +0800, Ming Lei wrote:
quoted
On Mon, Sep 27, 2021 at 09:38:02AM -0700, Luis Chamberlain wrote:
quoted
When driver sysfs attributes use a lock also used on module removal we
can race to deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have module removal call
trigger. The module removal call code holds a lock, and then the
driver's sysfs file entry waits for the same lock. While holding the
lock the module removal tries to remove the sysfs entries, but these
cannot be removed yet as one is waiting for a lock. This won't complete
as the lock is already held. Likewise module removal cannot complete,
and so we deadlock.

This can now be easily reproducible with our sysfs selftest as follows:

./tools/testing/selftests/sysfs/sysfs.sh -t 0027

This uses a local driver lock. Test 0028 can also be used, that uses
the rtnl_lock():

./tools/testing/selftests/sysfs/sysfs.sh -t 0028

To fix this we extend the struct kernfs_node with a module reference
and use the try_module_get() after kernfs_get_active() is called. As
documented in the prior patch, we now know that once kernfs_get_active()
is called the module is implicitly guarded to exist and cannot be removed.
This is because the module is the one in charge of removing the same
sysfs file it created, and removal of sysfs files on module exit will wait
until they don't have any active references. By using a try_module_get()
after kernfs_get_active() we yield to let module removal trump calls to
process a sysfs operation, while also preventing module removal if a sysfs
operation is in already progress. This prevents the deadlock.

This deadlock was first reported with the zram driver, however the live
Looks not see the lock pattern you mentioned in zram driver, can you
share the related zram code?
I recommend to not look at the zram driver, instead look at the
test_sysfs driver as that abstracts the issue more clearly and uses
Looks test_sysfs isn't in linus tree, where can I find it?
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20210927-sysfs-generic-deadlock-fix
Also please
update your commit log about this wrong info if it can't be applied on
zram.
It does apply to zram, it is just that I have other fixes for zram in
my pipeline which will change the zram driver further, and so what makes
more sense is to abstract the issue into a selftest driver to
demonstrate the issue more clearly.

To reproduce the deadlock revert the patch in this thread and then run
either of these two tests as root:

./tools/testing/selftests/sysfs/sysfs.sh -w 0027
./tools/testing/selftests/sysfs/sysfs.sh -w 0028

You will need to enable the test_sysfs driver.
quoted
two different locks as an example. The point is that if on module
removal *any* lock is used which is *also* used on the sysfs file
created by the module, you can deadlock.
quoted
quoted
And this can lead to this condition:

CPU A                              CPU B
                                   foo_store()
foo_exit()
  mutex_lock(&foo)
                                   mutex_lock(&foo)
   del_gendisk(some_struct->disk);
     device_del()
       device_remove_groups()
I guess the deadlock exists if foo_exit() is called anywhere. If yes,
look the issue may not be related with removing module directly, right?
No, the reason this can deadlock is that the module exit routine will
patiently wait for the sysfs / kernfs files to be stop being used,
Can you share the code which waits for the sysfs / kernfs files to be
stop being used?
How about a call trace of the two tasks which deadlock, here is one of
running test 0027:

kdevops login: [  363.875459] INFO: task sysfs.sh:1271 blocked for more
than 120 seconds.
[  363.878341]       Tainted: G            E
5.15.0-rc3-next-20210927+ #83
[  363.881218] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  363.882255] task:sysfs.sh        state:D stack:    0 pid: 1271 ppid:
1 flags:0x00000004
[  363.882894] Call Trace:
[  363.883091]  <TASK>
[  363.883259]  __schedule+0x2fd/0x990
[  363.883551]  schedule+0x43/0xe0
[  363.883800]  schedule_preempt_disabled+0x14/0x20
[  363.884160]  __mutex_lock.constprop.0+0x249/0x470
[  363.884524]  test_dev_x_store+0xa5/0xc0 [test_sysfs]
[  363.884915]  kernfs_fop_write_iter+0x177/0x220
[  363.885257]  new_sync_write+0x11c/0x1b0
[  363.885556]  vfs_write+0x20d/0x2a0
[  363.885821]  ksys_write+0x5f/0xe0
[  363.886081]  do_syscall_64+0x38/0xc0
[  363.886359]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.886748] RIP: 0033:0x7fee00f8bf33
[  363.887029] RSP: 002b:00007ffd372c5d18 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  363.887633] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fee00f8bf33
[  363.888217] RDX: 0000000000000003 RSI: 000055a4d14a0db0 RDI: 0000000000000001
[  363.888761] RBP: 000055a4d14a0db0 R08: 000000000000000a R09: 0000000000000002
[  363.889267] R10: 000055a4d1554ac0 R11: 0000000000000246 R12: 0000000000000003
[  363.889983] R13: 00007fee0105c6a0 R14: 0000000000000003 R15: 00007fee0105c8a0
[  363.890513]  </TASK>
[  363.890709] INFO: task modprobe:1276 blocked for more than 120 seconds.
[  363.891185]       Tainted: G            E 5.15.0-rc3-next-20210927+ #83
[  363.891781] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.892353] task:modprobe        state:D stack:    0 pid: 1276 ppid: 1 flags:0x00004000
[  363.892955] Call Trace:
[  363.893141]  <TASK>
[  363.893457]  __schedule+0x2fd/0x990
[  363.893865]  schedule+0x43/0xe0
[  363.894246]  __kernfs_remove.part.0+0x21e/0x2a0
[  363.894704]  ? do_wait_intr_irq+0xa0/0xa0
[  363.895142]  kernfs_remove_by_name_ns+0x50/0x90
[  363.895632]  remove_files+0x2b/0x60
[  363.896035]  sysfs_remove_group+0x38/0x80
[  363.896470]  sysfs_remove_groups+0x29/0x40
[  363.896912]  device_remove_attrs+0x5b/0x90
[  363.897352]  device_del+0x183/0x400
[  363.897758]  unregister_test_dev_sysfs+0x5b/0xaa [test_sysfs]
[  363.898317]  test_sysfs_exit+0x45/0xfb0 [test_sysfs]
[  363.898833]  __do_sys_delete_module+0x18d/0x2a0
[  363.899329]  ? fpregs_assert_state_consistent+0x1e/0x40
[  363.899868]  ? exit_to_user_mode_prepare+0x3a/0x180
[  363.900390]  do_syscall_64+0x38/0xc0
[  363.900810]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  363.901330] RIP: 0033:0x7f21915c57d7
[  363.901747] RSP: 002b:00007ffd90869fe8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  363.902442] RAX: ffffffffffffffda RBX: 000055ce676ffc30 RCX: 00007f21915c57d7
[  363.903104] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055ce676ffc98
[  363.903782] RBP: 000055ce676ffc30 R08: 0000000000000000 R09: 0000000000000000
[  363.904462] R10: 00007f2191638ac0 R11: 0000000000000206 R12: 000055ce676ffc98
[  363.905128] R13: 0000000000000000 R14: 0000000000000000 R15: 000055ce676ffdf0
[  363.905797]  </TASK>


And gdb:

(gdb) l *(__kernfs_remove+0x21e)
0xffffffff8139288e is in __kernfs_remove (fs/kernfs/dir.c:476).
471                     if (atomic_read(&kn->active) != KN_DEACTIVATED_BIAS)
472                             lock_contended(&kn->dep_map, _RET_IP_);
473             }
474
475             /* but everyone should wait for draining */
476             wait_event(root->deactivate_waitq,
477                        atomic_read(&kn->active) == KN_DEACTIVATED_BIAS);
478
479             if (kernfs_lockdep(kn)) {
480                     lock_acquired(&kn->dep_map, _RET_IP_);

(gdb) l *(kernfs_remove_by_name_ns+0x50)
0xffffffff813938d0 is in kernfs_remove_by_name_ns (fs/kernfs/dir.c:1534).
1529
1530            kn = kernfs_find_ns(parent, name, ns);
1531            if (kn)
1532                    __kernfs_remove(kn);
1533
1534            up_write(&kernfs_rwsem);
1535
1536            if (kn)
1537                    return 0;
1538            else

The same happens for test 0028 except instead of a mutex
lock an rtnl_lock() is used.

Would this be better for the commit log?
And why does it make a difference in case of being
called from module_exit()?
Well because that is where we remove the sysfs files. *If*
a developer happens to use a lock on a sysfs op but it is
also used on module exit, this deadlock is bound to happen.

  Luis
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help