Re: [PATCH] Btrfs: let super_stripesize match with sectorsize
From: Chandan Rajendra <hidden>
Date: 2016-06-16 08:24:11
On Wednesday, June 15, 2016 05:09:55 PM Liu Bo wrote:
On Wed, Jun 15, 2016 at 03:50:17PM +0530, Chandan Rajendra wrote:quoted
On Wednesday, June 15, 2016 09:12:28 AM Chandan Rajendra wrote:quoted
Hello Liu Bo, We have to fix the following check in check_super() as well, if (btrfs_super_stripesize(sb) != 4096) { error("invalid stripesize %u", btrfs_super_stripesize(sb)); goto error_out; } i.e. btrfs_super_stripesize(sb) must be equal to btrfs_super_sectorsize(sb). However in btrfs-progs (mkfs.c to be precise) since we had stripesize hardcoded to 4096, setting stripesize to the value of sectorsize in mkfs.c will cause the following to occur when mkfs.btrfs is invoked for devices with existing Btrfs filesystem instances, NOTE: Assume we have changed the stripesize validation in btrfs-progs' check_super() to, if (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb)) { error("invalid stripesize %u", btrfs_super_stripesize(sb)); goto error_out; } main() for each device file passed as an argument, test_dev_for_mkfs() check_mounted check_mounted_where btrfs_scan_one_device btrfs_read_dev_super check_super() call will fail for existing filesystems which have stripesize set to 4k. All existing filesystem instances will fall into this category. This error value is pushed up the call stack and this causes the device to not get added to the fs_devices_mnt list in check_mounted_where(). Hence we would fail to correctly check the mount status of the multi-device btrfs filesystems.We can end up in the following scenario, - /dev/loop0, /dev/loop1 and /dev/loop2 are mounted as a single filesystem. The filesystem was created by an older version of mkfs.btrfs which set stripesize to 4k. - losetup -a /dev/loop0: [0030]:19477 (/root/disk-imgs/file-0.img) /dev/loop1: [0030]:16577 (/root/disk-imgs/file-1.img) /dev/loop2: [64770]:3423229 (/root/disk-imgs/file-2.img) - /etc/mtab lists only /dev/loop0 - "losetup /dev/loop4 /root/disk-imgs/file-1.img" The new mkfs.btrfs invoked as 'mkfs.btrfs -f /dev/loop4' succeeds even though /dev/loop1 has already been mounted and has /root/disk-imgs/file-1.img as its backing file. So IMHO the only solution is to have the stripesize check in check_super() to allow both '4k' and 'sectorsize' as valid values i.e. if ((btrfs_super_stripesize(sb) != 4096) && (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb))) { error("invalid stripesize %u", btrfs_super_stripesize(sb)); goto error_out; }That's a good one. But if we go back to the original point, in the kernel side, 1. in open_ctree(), root->stripesize = btrfs_super_stripesize(); 2. in find_free_extent(), ... search_start = ALIGN(offset, root->stripesize); ... 3. in btrfs_alloc_tree_block(), ... ret = btrfs_reserve_extent(..., &ins, ...); ... buf = btrfs_init_new_buffer(trans, root, ins.objectid, level); 4. in btrfs_init_new_buffer(), ... buf = btrfs_find_create_tree_block(root, bytenr); ... Because 'num_bytes' we pass to find_free_extent() is aligned to sectorsize, the free space we can find is aligned to sectorsize, which means 'offset' in '1. find_free_extent()' is aligned to sectorsize. If our stripesize is larger than sectorsize, say 4 * sectorsize, for data allocation it's fine while for metadata block allocations it's not. It is possible that when we allocate a new metadata block, we can end up with an existing eb returned by '4. in btrfs_init_new_buffer()'.
Liu, I am sorry ... I am unable to visualize a scenario where the above described scenario could happen. Can you please provide an example?
PS: There is something wrong around '2. in find_free_extent()', we only do alignment on offset, but for num_bytes, we don't do that, so we may end up with a overlap with other data extents or metadata blocks. So I think we can just replace this ALING with a warning since the offset returned by searching free space tree is aligned to block_group->full_stripe_len, which is either sectorsize or BTRFS_STRIPE_LEN * nr_stripes (for raid56), then we can just drop the check for stripesize everywhere.
-- chandan