Re: [PATCH 02/11] block: Fix race of bdev open with gendisk shutdown

From: Hou Tao <hidden>
Date: 2017-11-24 03:13:31

Hi Jan,

On 2017/11/21 0:43, Jan Kara wrote:

Hi Tao!

On Fri 17-11-17 14:51:18, Hou Tao wrote:

quoted

On 2017/3/13 23:14, Jan Kara wrote:

quoted

blkdev_open() may race with gendisk shutdown in two different ways.
Either del_gendisk() has already unhashed block device inode (and thus
bd_acquire() will end up creating new block device inode) however
gen_gendisk() will still return the gendisk that is being destroyed.
Or bdev returned by bd_acquire() will get unhashed and gendisk destroyed
before we get to get_gendisk() and get_gendisk() will return new gendisk
that got allocated for device that reused the device number.

In both cases this will result in possible inconsistencies between
bdev->bd_disk and bdev->bd_bdi (in the first case after gendisk gets
destroyed and device number reused, in the second case immediately).

quoted

Fix the problem by checking whether the gendisk is still alive and inode
hashed when associating bdev inode with it and its bdi. That way we are
sure that we will not associate bdev inode with disk that got past
blk_unregister_region() in del_gendisk() (and thus device number can get
reused). Similarly, we will not associate bdev that was once associated
with gendisk that is going away (and thus the corresponding bdev inode
will get unhashed in del_gendisk()) with a new gendisk that is just
reusing the device number.

I have run some extended tests based on Omar Sandoval's script [1] these days,
and found two new problems related with the race between bdev open and gendisk
shutdown.

The first problem lies in __blkdev_get(), and will cause use-after-free, or
worse oops. It happens because there may be two instances of gendisk related
with one bdev. When the process which owns the newer gendisk completes the
invocation of __blkdev_get() firstly, the other process which owns the older
gendisk will put the last reference of the older gendisk and cause use-after-free.

The following call sequences illustrate the problem:

unhash the bdev_inode of /dev/sda

process A (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns gendisk (v1 gendisk)

remove gendisk from bdev_map
device number is reused

^^ this is through blk_unregister_region() in del_gendisk(), isn't it?

Yes, both the unhash of inode and the removal of gendisk from bdev_map
happen in del_gendisk().

quoted

process B (blkdev_open()):
	bd_acquire() returns new bdev_inode of /dev/sda
	get_gendisk() returns a new gendisk (v2 gendisk)
	increase bdev->bd_openers

So this needs to be a bit more complex - bd_acquire() must happen before
bdev_unhash_inode() in del_gendisk() above happened. And get_gendisk() must
happen after blk_unregister_region() from del_gendisk() happened. But yes,
this seems possible.

quoted

process A (blkdev_open()):
	find bdev->bd_openers != 0
	put_disk(disk) // this is the last reference to v1 gendisk
	disk_unblock_events(disk) // use-after-free occurs

The problem can be fixed by an extra check showed in the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	if (!bdev->bd_openers) {
	} else {
		if (bdev->bd_disk != disk) {
			ret = -ENXIO;
			goto out_unlock_bdev;
		}
	}
}

And we also can simplify the check in your patch by moving
blk_unregister_region() before bdev_unhash_inode(), so whenever we get a
new bdev_inode, the gendisk returned by get_gendisk() will either be NULL
or a new gendisk.

I don't think we can do that. If we call blk_unregister_region() before
bdev_unhash_inode(), the device number can get reused while bdev inode is
still visible. So you can get that bdev inode v1 associated with gendisk
v2. And open following unhashing of bdev inode v1 will get bdev inode v2
associated with gendisk v2. So you have two different bdev inodes
associated with the same gendisk and that will cause data integrity issues.

Even we do not move blk_unregister_region(), the data integrity issue still
exists: bd_acquire() is invoked before bdev_unhash_inode() and get_gendisk()
is invoked after blk_unregister_region(). The move is used to simplify the
consistency check or the version check between bdev_inode and gendisk, so
the added check above will be unnecessary, because whenever bd_acquire()
returns a new bdev_inode, the gendisk returned by get_gendisk() will also
a new gendisk or NULL. Anyway, it's just another workaround, so it's OK
we do not do that.

But what you suggest above is probably a reasonable fix. Will you submit a
proper patch please?

Uh, it's also an incomplete fix to the race problem. Anyhow i will do it.

quoted

And the only check we need to do in __blkdev_get() is
checking the hash status of the bdev_inode after we get a gendisk from
get_gendisk(). The check can be done like the following snippet:

static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
{
	/* ...... */
	ret = -ENXIO;
	disk = get_gendisk(bdev->bd_dev, &partno);
	if (!disk)
		goto out;
	owner = disk->fops->owner;

	if (inode_unhashed(bdev->bd_inode))
		goto out_unmatched;
}

The second problem lies in bd_start_claiming(), and will trigger the
BUG_ON assert in blkdev_get: BUG_ON(!bd_may_claim(bdev, whole, holder)).
It occurs when testing the exclusive open and close of the disk and its
partitions.

The cause of the problem is that a bdev_inode of a disk partition
corresponds with two instances of bdev_inode of the whole disk
simultaneously. When one pair of the partition inode and disk inode
completes the claiming, the other pair will be stumbled by the BUG_ON
assert in blkdev_get() because bdev->bd_holder is no longer NULL.

The following sequences illustrate the problem:

unhash the bdev_inode of /dev/sda2

process A (blkdev_open()):
	bd_acquire() returns a new bdev_inode for /dev/sda2
	bd_start_claiming() returns bdev_inode of /dev/sda

process B (blkdev_open()):
	bd_acquire() returns the new bdev_inode for /dev/sda2

unhash the bdev_inode of /dev/sda
remove gendisk from bdev_map
device number is reused

process B (blkdev_open()):
	bd_start_claming() returns a new bdev_inode for /dev/sda

process A (blkdev_open()):
	__blkdev_get() returns successfully
	finish claiming  // bdev->bd_holder = holder

process B (blkdev_open()):
	__blkdev_get() returns successfully
	trigger BUG_ON(!bd_may_claim(bdev, whole, holder))

Ah, good spotting. Essentially the whole->bd_claiming protection for
partition bdev does not work because we've got two different 'whole' bdev
pointers.

Yes, perfect summary.

quoted

And the problem can be fixed by adding more checks in bd_start_claiming():

static struct block_device *bd_start_claiming(struct block_device *bdev,
                                              void *holder)
{
	/* ...... */
        disk = get_gendisk(bdev->bd_dev, &partno);
        if (!disk)
                return ERR_PTR(-ENXIO);

        if (inode_unhashed(bdev->bd_inode)) {
                module_put(disk->fops->owner);
                put_disk(disk);
                return ERR_PTR(-ENXIO);
        }

	/* ...... */
        if (!whole)
                return ERR_PTR(-ENOMEM);

        if (inode_unhashed(bdev->bd_inode) ||
                inode_unhashed(whole->bd_inode)) {
                bdput(whole);
                return ERR_PTR(-ENXIO);
        }
}

Except adding more and more checks, are there better solutions to the
race problem ?  I have thought about managing the device number by
ref-counter, so the device number will not be reused until the last
reference of bdev_inode and gendisk are released, and i am trying to
figure out a way to get/put a reference of the device number when get/put
the block_device.

So any suggests and thoughts ?

Yeah, these races are nasty. I will think whether there's some reasonably
easy way to fixing them or whether we'll have to go via some other route
like blocking the device number as you suggest. In either case thanks for
the report and the analysis!

Look forward to review them.

Regards,
Tao

								Honza

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help