Re: [PATCH 2/2] xfs: make sure link path does not go away at access

From: Dave Chinner <david@fromorbit.com>
Date: 2021-11-14 23:19:08
Also in: linux-fsdevel, lkml

On Fri, Nov 12, 2021 at 08:23:24AM +0100, Miklos Szeredi wrote:

On Fri, 12 Nov 2021 at 01:32, Dave Chinner [off-list ref] wrote:

quoted

On Thu, Nov 11, 2021 at 11:39:30AM +0800, Ian Kent wrote:

quoted

When following a trailing symlink in rcu-walk mode it's possible to
succeed in getting the ->get_link() method pointer but the link path
string be deallocated while it's being used.

Utilize the rcu mechanism to mitigate this risk.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Ian Kent <raven@themaw.net>
---
 fs/xfs/kmem.h      |    4 ++++
 fs/xfs/xfs_inode.c |    4 ++--
 fs/xfs/xfs_iops.c  |   10 ++++++++--
 3 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 54da6d717a06..c1bd1103b340 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h

@@ -61,6 +61,10 @@ static inline void  kmem_free(const void *ptr)
 {
      kvfree(ptr);
 }
+static inline void  kmem_free_rcu(const void *ptr)
+{
+     kvfree_rcu(ptr);
+}


 static inline void *

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a4f6f034fb81..aaa1911e61ed 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c

@@ -2650,8 +2650,8 @@ xfs_ifree(
       * already been freed by xfs_attr_inactive.
       */
      if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
-             kmem_free(ip->i_df.if_u1.if_data);
-             ip->i_df.if_u1.if_data = NULL;
+             kmem_free_rcu(ip->i_df.if_u1.if_data);
+             RCU_INIT_POINTER(ip->i_df.if_u1.if_data, NULL);
              ip->i_df.if_bytes = 0;
      }

How do we get here in a way that the VFS will walk into this inode
during a lookup?

I mean, the dentry has to be validated and held during the RCU path
walk, so if we are running a transaction to mark the inode as free
here it has already been unlinked and the dentry turned
negative. So anything that is doing a lockless pathwalk onto that
dentry *should* see that it is a negative dentry at this point and
hence nothing should be walking any further or trying to access the
link that was shared from ->get_link().

AFAICT, that's what the sequence check bug you fixed in the previous
patch guarantees. It makes no difference if the unlinked inode has
been recycled or not, the lookup race condition is the same in that
the inode has gone through ->destroy_inode and is now owned by the
filesystem and not the VFS.

Yes, the concern here is that without locking all the above can
theoretically happen between the sequence number check and  if_data
being dereferenced.

It would be good to describe the race condition in the commit message,
because it's not at all obvious to readers how this race condition
is triggered.

quoted

Otherwise, it might just be best to memset the buffer to zero here
rather than free it, and leave it to be freed when the inode is
freed from the RCU callback in xfs_inode_free_callback() as per
normal.

Just as a FYI ext4_evict_inode() does exactly this with it's inline
symlink data buffer it passes to ->get_link() when it is freeing the
inode when the last reference to an unlinked inode goes away:

        /*
         * Set inode->i_size to 0 before calling ext4_truncate(). We need
         * special handling of symlinks here because i_size is used to
         * determine whether ext4_inode_info->i_data contains symlink data or
         * block mappings. Setting i_size to 0 will remove its fast symlink
         * status. Erase i_data so that it becomes a valid empty block map.
         */
        if (ext4_inode_is_fast_symlink(inode))
                memset(EXT4_I(inode)->i_data, 0, sizeof(EXT4_I(inode)->i_data));
	inode->i_size = 0;

IOWs, if the pointer returned from ->get_link() on an ext4
filesystem is accessed by the VFS after ->evict() is called, then it
sees an empty buffer. IOWs, it's just not safe for the VFS to access
the inode's link buffer pointer the moment the last reference to an
unlinked inode goes away because it's contents are no longer
guaranteed to be valid.

I note that ext4 then immediately sets the VFS inode size to 0
length, indicating the link is no longer valid. It may well be that
XFS is not setting the VFS inode size to zero in this case (I don't
think the generic evict() path does that) so perhaps that's a guard
that could be used at the VFS level to avoid the race condition...

My suggestion was to use .free_inode instead of .destroy_inode, the
former always being called after an RCU grace period.

Which doesn't address the ext4 "zero the buffer in .evict", either.

And for XFS, it means we then likely need to add synchronise_rcu()
calls wherever we need to wait for inodes to be inactivated and/or
reclaimed because they won't even get queued for inactivation until
the current grace period expires. That's a potential locking
nightmare because inode inactivation and reclaim uses rcu protected
lockless algorithms to access XFS inodes without taking reference
counts....

I just can't see how this race condition is XFS specific and why
fixing it requires XFS to sepcifically handle it while we ignore
similar theoretical issues in other filesystems...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help