Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

From: NeilBrown <hidden>
Date: 2021-07-20 00:33:52
Also in: linux-nfs

On Tue, 20 Jul 2021, Josef Bacik wrote:

On 7/19/21 4:00 PM, J. Bruce Fields wrote:

quoted

On Mon, Jul 19, 2021 at 11:40:28AM -0400, Josef Bacik wrote:

quoted

Ok so setting aside btrfs for the moment, how does NFS deal with
exporting a directory that has multiple other file systems under
that tree?  I assume the same sort of problem doesn't occur, but why
is that?  Is it because it's a different vfsmount/sb or is there
some other magic making this work?  Thanks,

There are two main ways an NFS client can look up a file: by name or by
filehandle.  The former's the normal filesystem directory lookup that
we're used to.  If the name refers to a mountpoint, the server can cross
into the mounted filesystem like anyone else.

It's the lookup by filehandle that's interesting.  Typically the
filehandle includes a UUID and an inode number.  The server looks up the
UUID with some help from mountd, and that gives a superblock that nfsd
can use for the inode lookup.

As Neil says, mountd does that basically by searching among mounted
filesystems for one with that uuid.

So if you wanted to be able to handle a uuid for a filesystem that's not
even mounted yet, you'd need some new mechanism to look up such uuids.

That's something we don't currently support but that we'd need to
support if BTRFS subvolumes were automounted.  (And it might have other
uses as well.)

But I'm not entirely sure if that answers your question....

Right, because btrfs handles the filehandles ourselves properly with the 
export_operations and we encode the subvolume id's into those things to make 
sure we can always do the proper lookup.

I suppose the real problem is that NFS is exposing the inode->i_ino to the 
client without understanding that it's on a different subvolume.

Our trick of simply allocating an anonymous bdev every time you wander into a 
subvolume to get a unique st_dev doesn't help you guys because you are looking 
for mounted file systems.

I'm not concerned about the FH case, because for that it's already been crafted 
by btrfs and we know what to do with it, so it's always going to be correct.

The actual problem is that we can do

getattr(/file1)
getattr(/snap/file1)

on the client and the NFS server just blind sends i_ino with the same fsid 
because / and /snap are the same fsid.

Which brings us back to what HCH is complaining about.  In his view if we had a 
vfsmount for /snap then you would know that it was a different fs.  However that 
would only actually work if we generated a completely different superblock and 
thus gave /snap a unique fsid, right?

No, I don't think it needs to be a different superblock to have a
vfsmount.  (I don't know if it does to keep HCH happy).

If I "mount --bind /snap /snap" then I've created a vfsmnt with the
upper and lower directories identical - same inode, same superblock.
This is an existence-proof that you don't need a separate super-block.

If we did the automount thing, and the NFS server went down and came back up and 
got a getattr(/snap/file1) from a previously generated FH it would still work 
right, because it would come into the export_operations with the format that 
btrfs is expecting and it would be able to do the lookup.  This FH lookup would 
do the automount magic it needs to and then NFS would have the fsid it needs, 
correct?  Thanks,

Not quite.
An NFS filehandle (as generated by linux-nfsd) has two components (plus
a header).  The filesystem-part and the file-part.
The filesystem-part is managed by userspace (/usr/sbin/mountd).  The
code relies on every filesystem appearing in /proc/self/mounts.
The bytes chosen are either based on the uuid reported by 'libblkid', or the
fsid reported by statfs(), based on a black-list of filesystems for
which libblkid is not useful.  This list includes btrfs.
The file-part is managed in the kernel using export_operations.

For any given 'struct path' in the kernel, a filehandle is generated
(conceptually) by finding the closest vfsmnt (close to inode, far from
root) and asking user-space to map that.  Then passing the inode to the
filesystem and asking it to map that.

So, in your example, if /snap were a mount point, the kernel would ask
mountd to determine the filesystem-part of /snap, and the fact that the
file-part from btrfs contained the objectid for snap just be redundant
information.  If /snap couldn't be found in /proc/self/mounts after a
server restart, the filehandle would be stale.

If btrfs were to use automounts and create the vfsmnts that one might
normally expect, then nfsd would need there to be two different sorts of
mount points, ideally visible in /proc/mounts (maybe a new flag that
appears in the list of mount options? "internal" ??).

- there needs to be the current mountpoint which a expected to be
  present after a reboot, and is likely to introduce a new filesystem,
  and
- there are these "new" mountpoints which are on-demand and expose
  something that is (in some sense) part of the same filesystem.
  The key property that NFSd would depend on is that these mount points
  do NOT introduce a new name-space for file-handles (in the sense of
  export_operations).

To expand on that last point:
- If a filehandle is requested for an inode above the "new" mountpoint
  and another "below" the new mountpoint, they are guaranteed to be
  different.
- If a filehandle that was "below" the new mountpoint is passed to
  exportfs_decode_fh() together with the vfsmnt that was *above* the
  mountpoint, then it somehow does "the right thing".  Probably
  that would require changing exportfs_decode_fh() to return a
  'struct path' rather than just a 'struct dentry *'.

When nfsd detected one of these "internal" mountpoints during a lookup,
it would *not* call-out to user-space to create a new export, but it
*would* ensure that a new fsid was reported for all inodes in the new
vfsmnt.

NeilBrown

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help