Re: [RFC][V8][PATCH 0/5] btrfs: allocation_hint mode

From: Zygo Blaxell <hidden>
Date: 2021-12-14 20:04:34

On Tue, Dec 14, 2021 at 08:03:45PM +0100, Goffredo Baroncelli wrote:

On 12/13/21 23:49, Zygo Blaxell wrote:

quoted

On Mon, Dec 13, 2021 at 04:15:14PM -0500, Josef Bacik wrote:

quoted

On Mon, Dec 13, 2021 at 08:54:24PM +0100, Goffredo Baroncelli wrote:

quoted

Gentle ping :-)

Are there anyone of the mains developer interested in supporting this patch ?

I am open to improve it if required.

Sorry I missed this go by.  I like the interface, we don't have a use for
device->type yet, so this fits nicely.

I don't see the btrfs-progs patches in my inbox, and these don't apply, so
you'll definitely need to refresh for a proper review, but looking at these
patches they seem sane enough, and I like the interface.  I'd like to hear
Zygo's opinion as well.

I've been running earlier versions with modifications since summer 2020,
and this version mostly unmodified (rebase changes only) since it was
posted.  It seems to work, even in corner cases like converting balances,
replacing drives, and running out of space.  The "running out of space"
experience is on btrfs is weird at the best of times, and these patches
add some more new special cases, but it doesn't behave in ways that
would surprise a sysadmin familiar with how btrfs chunk allocation works.

One major piece that's missing is adjusting the statvfs (aka df)
available blocks field so that it doesn't include unallocated space on
any metadata-only devices.  Right now all the unallocated space on
metadata-only devices is counted as free even though it's impossible to
put a data block there, so anything that is triggered automatically
on "f_bavail < some_threshold" will be confused.

I don't think that piece has to block the rest of the patch series--if
you're not using the feature, df gives the right number (or at least the
same number it gave before), and if you are using the feature, you can
subtract the unavailable data space until a later patch comes along to
fix it.

I like

	echo data_only > /sys/fs/btrfs/$uuid/devinfo/3/type

Only to be clear, for now you can pass a numeric value to "type". Not a text
like your example.

However I want to put on the table another option: to not expose all the
"type" field, but only the "allocation policy"; we can add a new sysfs field
called "allocation policy" that internally change the dev_item->type field.

It is not only a "cosmetic" change. If we want to change the allocation
policy, now the correct way is:
- read the type field
- change the "allocation policy" bits
- write the type field

Which is race 'prone'

For now it is not a problem, because type contains only the allocation bits.
But in future when the type field will contains further properties this could
be a problem.

Yeah, keep the interface very narrow, don't hand out access to random bits.

If the kernel supports additional bits, it should support additional
sysfs filenames to go with them.  Or it could put all the supported
options in the sysfs field, like block IO schedulers do, so you could
find this in the file by reading it:

	[prefer_data] prefer_metadata metadata_only data_only

quoted

more than patching btrfs-progs so I can use

	btrfs prop set /dev/... allocation_hint data_only

but I admit that might be because I'm weird.

I prefer the echo approach too; however it is not very ergonomics in conjunction
to sudo....

For /proc/sys/* we have the 'sysctl' tool, so you can write 'sysctl
vm.drop_caches=1' or 'sudo sysctl vm.drop_caches=1'.  For some reason
we don't have this for sysfs (or maybe it's just Debian...?) so we have
to write things like 'echo foo | sudo tee /sys/fs/...'.

Of course btrfs-progs could always open the
/sys/fs/btrfs/.../allocation_policy file and write to it.  But if we're
modifying btrfs-progs then we could use the ioctl interface anyway.

I don't have a strong preference for either sysfs or ioctl, nor am I
opposed to simply implementing both.  I'll let someone who does have
such a preference make their case.

quoted

If we're going to use device->type for this, and since we don't have a user of
device->type, I'd also like you to go ahead and re-name ->type to
->allocation_policy, that way it's clear what we're using it for now.

I'd also like some xfstests to validate the behavior so we're sure we're testing
this.  I'd want 1 test to just test the mechanics, like mkfs with different
policies and validate they're set right, change policies, add/remove disks with
different policies.

Then a second test to do something like fsstress with each set of allocation
policies to validate that we did actually allocate from the correct disks.  For
this please also test with compression on to make sure the test validation works
for both normal allocation and compression (ie it doesn't assume writing 5gib of
data == 5 gib of data usage, as compression chould give you a different value).

With that in place I think this is the correct way to implement this feature.
Thanks,

Josef


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help