Re: [PATCH v2] BTRFS/NFSD: provide more unique inode number for btrfs export

From: NeilBrown <hidden>
Date: 2021-09-20 22:11:11
Also in: linux-nfs

On Tue, 14 Sep 2021, Amir Goldstein wrote:

On Tue, Sep 14, 2021 at 1:59 AM NeilBrown [off-list ref] wrote:

quoted

On Mon, 13 Sep 2021, Amir Goldstein wrote:

quoted

Right, so the right fix IMO would be to provide similar semantics
to the NFS client, like your first patch set tried to do.

Like every other approach, this sounds good and sensible ...  until
you examine the details.

For NFSv3 (RFC1813) this would be a protocol violation.
Section 3.3.3 (LOOKUP) says:
  A server will not allow a LOOKUP operation to cross a mountpoint to
  the root of a different filesystem, even if the filesystem is
  exported.

The filesystem is represented by the fsid, so this implies that the fsid
of an object reported by LOOKUP must be the same as the fsid of the
directory used in the LOOKUP.

Linux NFS does allow this restriction to be bypassed with the "crossmnt"
export option.  Maybe if crossmnt were given it would be defensible to
change the fsid - if crossmnt is not given, we leave the current
behaviour.  Note that this is a hack and while it is extremely useful,
it does not produce a seemly experience.  You can get exactly the same
problems with "find" - just not as uniformly (mounting with "-o noac"
makes them uniform).

I don't understand why we would need to talk about NFSv3.
This btrfs export issue has been with us for a while.
I see no reason to address it for old protocols if we can address
it with a new protocol with better support for the concept of fsid hierarchy.

quoted

For NFSv4, we need to provide a "mounted-on" fileid for any mountpoint.
btrfs doesn't have a mounted-on fileid that can be used.  We can fake
something that might work reasonably well - but it would be fake.  (but
then ... btrfs already provided bogus information in getdents when there
is a subvol root in the directory).

That seems easy to solve by passing some flag to ->encode_fh()
or if the behavior is persistent in btrfs by some mkfs/module/mount option
then btrfs_encode_fh() will always encode the subvol root inode as
resident of the parent tree-id, because nfsd anyway does not ->encode_fh()
for export roots, right?

->encode_fh has nothing to do with getting the mounted-on fileid.
With a normal mount point, there are two inodes, one in each vfsmount.
We can call ->getattr to get kstat info including the inode number.
nfsd does that for the underlying vfsmnt/inode to get the mounted-on
fileid.  What should it do for btrfs "subvols"?

quoted

But these are relatively minor.  The bigger problem is /proc/mounts.  If
btrfs maintainers were willing to have every active subvolume appear in
/proc/mounts, then I would be happy to fiddle the NFS fsid and allow
every active NFS/btrfs subvolume to appear in /proc/mounts on the NFS
client.  But they aren't.  So I am not.

I don't understand why you need to tie the two together.

Because they are the same thing.
The most concrete reason is that any name that appears in /proc/mounts is
public.  People understand that when they mount filesystems.  People
don't need to understand that when creating private subvols.
There is anecdotal evidence that people might expect subvol paths to be
private.  If they then access those subvols via NFS, the names suddenly
become public.

I would suggest:
1. Export different fsid's per subvols to NFSv4 based on statx()
exported tree-id
2. NFS client side uses user configuration to determine which subvols
to auto-mount

That is a non-started.  Subvols currently don't need mounting, they
transparently appear.  Requiring client-side configuration would be a
major cost for some users.

3. [optional] Provide a way to configure btrfs using mkfs/module/mount option
    to behave locally the same as the NFS client, which will allow
user configuration
    to determine with subvols to auto-mount locally

I admit that my understanding of the full picture is limited, but I don't
understand why #3 is a strict dependency for #1 and #2.

quoted

And I really don't see how an nfs export option would help...  Different
people within and organisation and using the same export might have
different expectations.

That's true.
But if admin decides to export a specific btrfs mount as a non-unified
filesystem, then NFS clients can decide whether ot not to auto-mount the
exported subvolumes and different users on the client machine can decide
if they want to rsync or rsync --one-file-system, just as they would with
local btrfs.

And maybe I am wrong, but I don't see how the decision on whether to
export a non-unified btrfs can be made a btrfs option or a nfsd global
option, that's why I ended up with export option.

Just because a btrfs option and global nfsd option are bad, that doesn't
mean an export option must be good.  It needs to be presented and
defended on its own merits.

My current opinion (and I must admit I am feeling rather jaded about the
whole thing), is that while btrfs is a very interesting and valuable
experiment in fs design, it contains core mistakes that cannot be
incrementally fixed.  It should be marked as legacy with all current
behaviour declared as intentional and not subject to change.  This would
make way for a new "betrfs" which was designed based on all that we have
learned.  It would use the same code base, but present a more coherent
interface.  Exactly what that interface would be has yet to be decided,
but we would not be bound to maintain anything just because btrfs
supports it.

There is no need for a new driver name (like ext3=>ext4)
Both ext4 and xfs have features that can be determined in mkfs time.
This user experience change does not involve on-disk format changes,
so it is a much easier case, because at least technically, there is nothing
preventing an administrator from turning the user experience feature
on and off with proper care of the consequences.

Which brings me to another point.
This discussion presents several technical challenges and you have
been very creative in presenting technical solutions, but I think that
the nature of the problem is more on the administrative side.

I see this as an unfortunate flaw in our design process, when
filesystem developers have long discussions about issues where
some of the material stakeholders (i.e. administrators) are not in the loop.
But I do not have very good ideas on how to address this flaw.

I agree this is more than a technical question.  I don't see it as
particularly an admin issue, because non-admin users can create subvols.

I see it as a conceptual problem.  What is a "subvol"? What do we want
it to be.  Does it make sense for the subvol namespace to align with the
filesystem namespace?
Subvols are more than directories, but less than filesystems.  How can
be best characterise them and thing about them? Are they directories
with extra features, or filesystems with some limitation (and some extra
features)?  Or are they something completely new?
What sort of identity information do applications *need* about files an
filesystems and how can we best provide that within the context of
existing APIs?

NeilBrown

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help