Thread (12 messages) 12 messages, 4 authors, 2021-10-29

Re: [PATCH v2 0/4] btrfs: sysfs: set / query btrfs stripe size

From: Stefan Roesch <hidden>
Date: 2021-10-29 03:11:55


On 10/28/21 8:00 AM, Johannes Thumshirn wrote:
On 28/10/2021 16:27, Josef Bacik wrote:
quoted
On Thu, Oct 28, 2021 at 09:43:51AM -0400, Josef Bacik wrote:
quoted
On Wed, Oct 27, 2021 at 01:14:37PM -0700, Stefan Roesch wrote:
quoted
Motivation:
The btrfs allocator is currently not ideal for all workloads. It tends
to suffer from overallocating data block groups and underallocating
metadata block groups. This results in filesystems becoming read-only
even though there is plenty of "free" space.

This is naturally confusing and distressing to users.

Patches:
1) Store the stripe and chunk size in the btrfs_space_info structure
2) Add a sysfs entry to expose the above information
3) Add a sysfs entry to force a space allocation
4) Increase the default size of the metadata chunk allocation to 5GB
   for volumes greater than 50GB.

Testing:
  A new test is being added to the xfstest suite. For reference the
  corresponding patch has the title:
    [PATCH] btrfs: Test chunk allocation with different sizes

  In addition also manual testing has been performed.
    - Run xfstests with the changes and the new test. It does not
      show new diffs.
    - Test with storage devices 10G, 20G, 30G, 50G, 60G
      - Default allocation
      - Increase of chunk size
      - If the stripe size is > the free space, it allocates
        free space - 1MB. The 1MB is left as free space.
      - If the device has a storage size > 50G, it uses a 5GB
        chunk size for new allocations.

Stefan Roesch (4):
  btrfs: store stripe size and chunk size in space-info struct.
  btrfs: expose stripe and chunk size in sysfs.
  btrfs: add force_chunk_alloc sysfs entry to force allocation
  btrfs: increase metadata alloc size to 5GB for volumes > 50GB
Sorry, I had this thought previously but it got lost when I started doing the
actual code review.

We have conflated stripe size and chunk size here, and unfortunately "stripe
size" means different things to different people.  What you are actually trying
to do here is to allow us to allocate a larger logical chunk size.

In terms of how this works out in the code you are changing the correct thing,
generally the stripe_size is what dictates the actual block group chunk size we
end up with at the end.

But this is sort of confusing when it comes to the interface, because people are
going to think it means something different.

Instead we should name the sysfs file chunk_size, and then keep the code you
have the way it is, just with the new name.  That way it's clear to the user
that they're changing how large of a chunk we're allocating at any given time.

Make that change, and I have a few other code comments, and then that should be
good.  Thanks,
In fact I talked about this with Johannes just now.  We sort of conflate the two
things, max_chunk_size and max_stripe_size, to get the answer we want.  But
these aren't well named and don't really behave in a way you'd expect.

Currently, we set max_stripe_size to make sure we clamp down on any dev extents
we find.  So if the whole disk is free we clearly don't want to allocate the
whole thing, so we clamp it down to max_stripe_size.  This, in effect, ends up
being our actual chunk_size.  We have this max_chunk_size thing but it doesn't
really do anything in practice because our stripe_size is already clamped down
so it'll be <= max_chunk_size.
We should also add an ASSERT() to verify we're really never ever going
beyond max_chunk_size.
 
Do you want an ASSERT() against BTRFS_MAX_DATA_CHUNK_SIZE?
quoted
All this is to say we should simply set max_stripe_size = max_chunk_size, but
call max_chunk_size default_chunk_size, because that's really what it is.  So
you should

1) Change the sysfs file to be chunk_size or something similar.
2) Don't expose stripe_size via sysfs, it's just a function of chunk_size.
3) Set stripe_size == chunk_size.
4) Get rid of the max_chunk_size logic, it's unneeded.

I think that's the proper way to deal with everything, if there are any corners
I'm missing then feel free to point them out, but I'm pretty sure 1-3 are
correct.  Thanks,

Josef
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help