Re: [PATCH v3 2/3] zram: fix deadlock with sysfs attribute usage and driver removal

From: Luis Chamberlain <mcgrof@kernel.org>
Date: 2021-06-22 16:32:19
Also in: lkml

On Tue, Jun 22, 2021 at 09:45:39AM +0200, Greg KH wrote:

On Mon, Jun 21, 2021 at 04:36:34PM -0700, Luis Chamberlain wrote:

quoted

When sysfs attributes use a lock also used on driver removal we can
potentially deadlock. This happens when for instance a sysfs file on
a driver is used, then at the same time we have driver removal trigger.
The driver removal code holds a lock, and then the sysfs file entry waits
for the same lock. While holding the lock the driver removal tries to
remove the sysfs entries, but these cannot be removed yet as one is
waiting for a lock. This won't complete as the lock is already held.
Likewise module removal cannot complete, and so we deadlock.

This is all about removing modules, not about "driver removal" from a
device.  Please make the subject line here more explicit.

Sure.

quoted

To fix this we just *try* to get a refcount to the module when a shared
lock is used, prior to mucking with a sysfs attribute. If this fails we
just give up right away.

We use a try method as a full lock means we'd then make our sysfs attributes
busy us out from possible module removal, and so userspace could force denying
module removal, a silly form of "DOS" against module removal. A try lock on
the module removal ensures we give priority to module removal and interacting
with sysfs attributes only comes second. Using a full lock could mean for
instance that if you don't stop poking at sysfs files you cannot remove a
module.

This deadlock was first reported with the zram driver, a sketch of how
this can happen follows:

CPU A                              CPU B
                                   whatever_store()
module_unload
  mutex_lock(foo)
                                   mutex_lock(foo)
   del_gendisk(zram->disk);
     device_del()
       device_remove_groups()

Can you duplicate this in a real-world situation?

What tools remove the zram module from the system on the fly?

A customer did run into it through a series of automated tests. I was
able to finally reproduce with the instructions given below. I
simplified it given that the series of test the customer was running
was much more complex.

quoted

In this situation whatever_store() is waiting for the mutex foo to
become unlocked, but that won't happen until module removal is complete.
But module removal won't complete until the syfs file being poked completes
which is waiting for a lock already held.

This is a generic kernel issue with sysfs files which use any lock also
used on module removal. Different generic solutions have been proposed.
One approach proposed is by directly by augmenting attributes with module
information [0]. This patch implements a solution by adding macros with
the prefix MODULE_DEVICE_ATTR_*() which accomplish the same. Until we
don't have a generic agreed upon solution for this shared between drivers,
we must implement a fix for this on each driver.

We make zram use the new MODULE_DEVICE_ATTR_*() helpers, and completely
open code the solution for class attributes as there are only a few of
those.

This issue can be reproduced easily on the zram driver as follows:

Loop 1 on one terminal:

while true;
	do modprobe zram;
	modprobe -r zram;
done

Loop 2 on a second terminal:
while true; do
	echo 1024 >  /sys/block/zram0/disksize;
	echo 1 > /sys/block/zram0/reset;
done

As fun as this is, it's not a real workload, please do not pretend that
it is.

Whoever said that it was? This is just a way to reproduce an issue which
was reported.

And your code is still racy, see below.  You just made the window even
smaller, which you still should be objecting to as you somehow feel this
is a valid usecase :)

quoted

@@ -2048,13 +2048,19 @@ static ssize_t hot_add_show(struct class *class,
 {
 	int ret;
 
+	if (!try_module_get(THIS_MODULE))
+		return -ENODEV;
+

You can not increment/decrement your own module's reference count and
expect it to work properly, as it is still a race.

The goal here is to prevent an rmmod call if this succeeds. If it
succeeds then any subsequent rmmod will fail. Can you explain how
this is still racy?

  Luis

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help