Thread (122 messages) 122 messages, 21 authors, 2021-08-25

Re: [PATCH/RFC 00/11] expose btrfs subvols in mount table correctly

From: Amir Goldstein <amir73il@gmail.com>
Date: 2021-07-30 05:28:19
Also in: linux-fsdevel, linux-nfs

On Fri, Jul 30, 2021 at 5:41 AM NeilBrown [off-list ref] wrote:

I've been pondering all the excellent feedback, and what I have learnt
from examining the code in btrfs, and I have developed a different
perspective.

Maybe "subvol" is a poor choice of name because it conjures up
connections with the Volumes in LVM, and btrfs subvols are very different
things.  Btrfs subvols are really just subtrees that can be treated as a
unit for operations like "clone" or "destroy".

As such, they don't really deserve separate st_dev numbers.

Maybe the different st_dev numbers were introduced as a "cheap" way to
extend to size of the inode-number space.  Like many "cheap" things, it
has hidden costs.

Maybe objects in different subvols should still be given different inode
numbers.  This would be problematic on 32bit systems, but much less so on
64bit systems.

The patch below, which is just a proof-of-concept, changes btrfs to
report a uniform st_dev, and different (64bit) st_ino in different subvols.

It has problems:
 - it will break any 32bit readdir and 32bit stat.  I don't know how big
   a problem that is these days (ino_t in the kernel is "unsigned long",
   not "unsigned long long). That surprised me).
 - It might break some user-space expectations.  One thing I have learnt
   is not to make any assumption about what other people might expect.

However, it would be quite easy to make this opt-in (or opt-out) with a
mount option, so that people who need the current inode numbers and will
accept the current breakage can keep working.

I think this approach would be a net-win for NFS export, whether BTRFS
supports it directly or not.  I might post a patch which modifies NFS to
intuit improved inode numbers for btrfsdemostrates exports....

So: how would this break your use-case??
The simple cases are find -xdev and du -x which expect the st_dev
change, but that can be excused if opting in to a unified st_dev namespace.

The harder problem is <st_dev;st_ino> collisions which are not even
that hard to hit with unlimited number of snapshots.
The 'diff' tool demonstrates the implications of <st_dev;st_ino>
collisions for different objects on userspace.
See xfstest overlay/049 for a demonstration.

The overlayfs xino feature made a similar change to overlayfs
<st_dev;st_ino> with one big difference - applications expect that
all objects in overlayfs mount will have the same st_dev.

Also, overlayfs has prior knowledge on the number of layers
so it is easier to parcel the ino namespace and avoid collisions.

Thanks,
Amir.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help