Re: [PATCH v2 0/4] btrfs: sysfs: set / query btrfs stripe size
From: Stefan Roesch <hidden>
Date: 2021-10-29 03:11:55
On 10/28/21 8:00 AM, Johannes Thumshirn wrote:
On 28/10/2021 16:27, Josef Bacik wrote:quoted
On Thu, Oct 28, 2021 at 09:43:51AM -0400, Josef Bacik wrote:quoted
On Wed, Oct 27, 2021 at 01:14:37PM -0700, Stefan Roesch wrote:quoted
Motivation: The btrfs allocator is currently not ideal for all workloads. It tends to suffer from overallocating data block groups and underallocating metadata block groups. This results in filesystems becoming read-only even though there is plenty of "free" space. This is naturally confusing and distressing to users. Patches: 1) Store the stripe and chunk size in the btrfs_space_info structure 2) Add a sysfs entry to expose the above information 3) Add a sysfs entry to force a space allocation 4) Increase the default size of the metadata chunk allocation to 5GB for volumes greater than 50GB. Testing: A new test is being added to the xfstest suite. For reference the corresponding patch has the title: [PATCH] btrfs: Test chunk allocation with different sizes In addition also manual testing has been performed. - Run xfstests with the changes and the new test. It does not show new diffs. - Test with storage devices 10G, 20G, 30G, 50G, 60G - Default allocation - Increase of chunk size - If the stripe size is > the free space, it allocates free space - 1MB. The 1MB is left as free space. - If the device has a storage size > 50G, it uses a 5GB chunk size for new allocations. Stefan Roesch (4): btrfs: store stripe size and chunk size in space-info struct. btrfs: expose stripe and chunk size in sysfs. btrfs: add force_chunk_alloc sysfs entry to force allocation btrfs: increase metadata alloc size to 5GB for volumes > 50GBSorry, I had this thought previously but it got lost when I started doing the actual code review. We have conflated stripe size and chunk size here, and unfortunately "stripe size" means different things to different people. What you are actually trying to do here is to allow us to allocate a larger logical chunk size. In terms of how this works out in the code you are changing the correct thing, generally the stripe_size is what dictates the actual block group chunk size we end up with at the end. But this is sort of confusing when it comes to the interface, because people are going to think it means something different. Instead we should name the sysfs file chunk_size, and then keep the code you have the way it is, just with the new name. That way it's clear to the user that they're changing how large of a chunk we're allocating at any given time. Make that change, and I have a few other code comments, and then that should be good. Thanks,In fact I talked about this with Johannes just now. We sort of conflate the two things, max_chunk_size and max_stripe_size, to get the answer we want. But these aren't well named and don't really behave in a way you'd expect. Currently, we set max_stripe_size to make sure we clamp down on any dev extents we find. So if the whole disk is free we clearly don't want to allocate the whole thing, so we clamp it down to max_stripe_size. This, in effect, ends up being our actual chunk_size. We have this max_chunk_size thing but it doesn't really do anything in practice because our stripe_size is already clamped down so it'll be <= max_chunk_size.We should also add an ASSERT() to verify we're really never ever going beyond max_chunk_size.
Do you want an ASSERT() against BTRFS_MAX_DATA_CHUNK_SIZE?
quoted
All this is to say we should simply set max_stripe_size = max_chunk_size, but call max_chunk_size default_chunk_size, because that's really what it is. So you should 1) Change the sysfs file to be chunk_size or something similar. 2) Don't expose stripe_size via sysfs, it's just a function of chunk_size. 3) Set stripe_size == chunk_size. 4) Get rid of the max_chunk_size logic, it's unneeded. I think that's the proper way to deal with everything, if there are any corners I'm missing then feel free to point them out, but I'm pretty sure 1-3 are correct. Thanks, Josef