Re: [PATCH] Btrfs: let super_stripesize match with sectorsize

From: Chandan Rajendra <hidden>
Date: 2016-06-16 08:24:11

On Wednesday, June 15, 2016 05:09:55 PM Liu Bo wrote:

On Wed, Jun 15, 2016 at 03:50:17PM +0530, Chandan Rajendra wrote:

quoted

On Wednesday, June 15, 2016 09:12:28 AM Chandan Rajendra wrote:

quoted

Hello Liu Bo,

We have to fix the following check in check_super() as well,

       if (btrfs_super_stripesize(sb) != 4096) {
       
                error("invalid stripesize %u",
                btrfs_super_stripesize(sb));
                goto error_out;
        
        }

i.e. btrfs_super_stripesize(sb) must be equal to
btrfs_super_sectorsize(sb).

However in btrfs-progs (mkfs.c to be precise) since we had stripesize
hardcoded to 4096, setting stripesize to the value of sectorsize in
mkfs.c will cause the following to occur when mkfs.btrfs is invoked for
devices with existing Btrfs filesystem instances,

NOTE: Assume we have changed the stripesize validation in btrfs-progs'
check_super() to,

        if (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb)) {
        
                error("invalid stripesize %u",
                btrfs_super_stripesize(sb));
                goto error_out;
        
        }

main()

 for each device file passed as an argument,
 
   test_dev_for_mkfs()
   
     check_mounted
     
       check_mounted_where
       
         btrfs_scan_one_device
         
           btrfs_read_dev_super
           
             check_super() call will fail for existing filesystems which

have stripesize set to 4k. All existing filesystem instances will fall
into
this category.

This error value is pushed up the call stack and this causes the device
to
not get added to the fs_devices_mnt list in check_mounted_where(). Hence
we
would fail to correctly check the mount status of the multi-device btrfs
filesystems.

We can end up in the following scenario,
- /dev/loop0, /dev/loop1 and /dev/loop2 are mounted as a single

  filesystem. The filesystem was created by an older version of mkfs.btrfs
  which set stripesize to 4k.

- losetup -a

   /dev/loop0: [0030]:19477 (/root/disk-imgs/file-0.img)
   /dev/loop1: [0030]:16577 (/root/disk-imgs/file-1.img)
   /dev/loop2: [64770]:3423229 (/root/disk-imgs/file-2.img)

- /etc/mtab lists only /dev/loop0
- "losetup /dev/loop4 /root/disk-imgs/file-1.img"

   The new mkfs.btrfs invoked as 'mkfs.btrfs -f /dev/loop4' succeeds even
   though /dev/loop1 has already been mounted and has
   /root/disk-imgs/file-1.img as its backing file.

So IMHO the only solution is to have the stripesize check in check_super()
to allow both '4k' and 'sectorsize' as valid values i.e.

        if ((btrfs_super_stripesize(sb) != 4096)
	    
	    && (btrfs_super_stripesize(sb) != btrfs_super_sectorsize(sb))) {
	    
                error("invalid stripesize %u",
                btrfs_super_stripesize(sb));
                goto error_out;
        
        }

That's a good one.

But if we go back to the original point, in the kernel side,
1. in open_ctree(), root->stripesize = btrfs_super_stripesize();

2. in find_free_extent(),
	...
	search_start = ALIGN(offset, root->stripesize);
	...
3. in btrfs_alloc_tree_block(),
	...
	ret = btrfs_reserve_extent(..., &ins, ...);
	...
	buf = btrfs_init_new_buffer(trans, root, ins.objectid, level);

4. in btrfs_init_new_buffer(),
	...
	buf = btrfs_find_create_tree_block(root, bytenr);
	...

Because 'num_bytes' we pass to find_free_extent() is aligned to
sectorsize, the free space we can find is aligned to sectorsize,
which means 'offset' in '1. find_free_extent()' is aligned to sectorsize.

If our stripesize is larger than sectorsize, say 4 * sectorsize,
for data allocation it's fine while for metadata block allocations it's
not.  It is possible that when we allocate a new metadata block, we can
end up with an existing eb returned by '4. in btrfs_init_new_buffer()'.

Liu, I am sorry ... I am unable to visualize a scenario where the above
described scenario could happen. Can you please provide an example?

PS: There is something wrong around '2. in find_free_extent()',
we only do alignment on offset, but for num_bytes, we don't do that,
so we may end up with a overlap with other data extents or metadata
blocks.

So I think we can just replace this ALING with a warning since the offset
returned by searching free space tree is aligned to
block_group->full_stripe_len, which is either sectorsize or
BTRFS_STRIPE_LEN * nr_stripes (for
raid56), then we can just drop the check for stripesize everywhere.

-- 
chandan

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help