Thread (6 messages) 6 messages, 4 authors, 2021-06-17

Re: Combining nodatasum + compression

From: Qu Wenruo <hidden>
Date: 2021-06-11 11:15:55


On 2021/6/11 下午6:55, Martin Raiber wrote:
On 11.06.2021 02:18 Qu Wenruo wrote:
quoted
On 2021/6/10 下午10:32, Martin Raiber wrote:
quoted
when btrfs is running on a block device that improves integrity (e.g.
Ceph), it's usefull to run it with nodatasum to reduce the amount of
metadata and random IO.

In that case it would also be useful to be able to run it with
compression combined with nodatasum as well.

I actually found that if nodatasum is specified after compress-force,
that it doesn't remove reset the compress/nodatasum bit, while the other
way around it doesn't work.

That combined with
--- linux-5.10.39/fs/btrfs/inode.c.orig 2021-05-31 16:11:03.537017580
+0200
+++ linux-5.10.39/fs/btrfs/inode.c      2021-05-31 16:11:19.461591523
+0200
@@ -408,8 +408,7 @@
   */
  static inline bool inode_can_compress(struct btrfs_inode *inode)
  {
-       if (inode->flags & BTRFS_INODE_NODATACOW ||
-           inode->flags & BTRFS_INODE_NODATASUM)
+       if (inode->flags & BTRFS_INODE_NODATACOW)
                 return false;
         return true;
  }

should do the trick. >
The above code was added with the argument that having no checksums with
compression would damage too much data in case of corruption (
https://lore.kernel.org/linux-btrfs/20180515073622.18732-2-wqu@suse.com/ (local)
).
It doesn't make a difference whether it's a single device fs or not.
If we don't have csum, the corruption is not only affecting the sector
where the corruption is, but the full compressed extent.
Doesn't really matter if it doesn't happen if btrfs is on a e.g. ceph
volume.
But this still means, all other regular end user can hit a case where
they set nodatasum for a directory and later corrupt their data.

This will increase the corruption loss, if user just migrate from older
kernel to newer one.

Yeah, you can blame the users for doing that, but I'm still not that
sure if such behavior change is fine.

Especially when this increase the chance of data loss.
It may be fine for ceph, but btrfs still have a wider user base to consider.
Furthermore it depends on the use-case if corruption affecting
one page, or a whole compressed block is something one can live with.
That's true, but my point is, at least we shouldn't increase the data
loss possibility for the end users.

If in the past, we allowed NODATASUM + compression, then converting to
current situation, I'm more or less fine, since it reduce the
possibility of data loss.

But not in the other direction.
quoted
Furthermore, it's not that simple.

Current code we always expect compressed read to find some csum.
Just check btrfs_submit_compressed_read(), it will call
btrfs_lookup_bio_sums().
Which will fill the csum array with 0 if it can not find any csum.
It seems to make that conditional w.r.t. BTRFS_INODE_NODATASUM

if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
     ret = btrfs_lookup_bio_sums(inode, comp_bio, (u64)-1, sums);
     BUG_ON(ret); /* -ENOMEM */
}and in check_compressed_csum: if (inode->flags & BTRFS_INODE_NODATASUM)
return 0; This seems to have been moved around recently (
https://github.com/torvalds/linux/commit/334c16d82cfe180f7b262a6f8ae2d9379f032b18
).
Forgot that patch, I should look one function deeper to confirm the
behavior.

That indeed reduces my concerns from the code point of view.

Thanks,
Qu
quoted
Then at endio callbacks, we verify the csum against the data we read, if
it's all zero, the csum will definitely mismatch and discard the data no
matter if it's correct or not.

The same thing applies to btrfs_submit_compressed_write(), it will
always generate csum.
Same here:

int skip_sum = inode->flags & BTRFS_INODE_NODATASUM;

...

if (!skip_sum) {
         ret = btrfs_csum_one_bio(inode, bio, start, 1);
         BUG_ON(ret); /* -ENOMEM */
  }
quoted
The diff will just give you a false sense of compression without csum.
It will still generate csum for write and relies on csum check at read
time. >>
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help