Re: [PATCH 2/2] xfs: make sure link path does not go away at access

From: Ian Kent <raven@themaw.net>
Date: 2021-11-16 01:07:03
Also in: linux-fsdevel, lkml

On Tue, 2021-11-16 at 09:24 +1100, Dave Chinner wrote:

On Mon, Nov 15, 2021 at 10:21:03AM +0100, Miklos Szeredi wrote:

quoted

On Mon, 15 Nov 2021 at 00:18, Dave Chinner [off-list ref]
wrote:

quoted

I just can't see how this race condition is XFS specific and why
fixing it requires XFS to sepcifically handle it while we ignore
similar theoretical issues in other filesystems...

It is XFS specific, because all other filesystems RCU free the in-
core
inode after eviction.

XFS is the only one that reuses the in-core inode object and that
is
very much different from anything the other filesystems do and what
the VFS expects.

Sure, but I was refering to the xfs_ifree issue that the patch
addressed, not the re-use issue that the *first patch addressed*.

quoted

I don't see how clearing the quick link buffer in
ext4_evict_inode()
could do anything bad.  The contents are irrelevant, the lookup
will
be restarted anyway, the important thing is that the buffer is not
freed and that it's null terminated, and both hold for the ext4,
AFAICS.

You miss the point (which, admittedly, probably wasn't clear).

I suggested just zeroing the buffer in xfs_ifree instead of zeroing
it, which you seemed to suggest wouldn't work and we should move the
XFS functionality to .free_inode. That's what I was refering to as
"not being XFS specific" - if it is safe for ext4 to zero the link
buffer in .evict while lockless lookups can still be accessing the
link buffer, it is safe for XFS to do the same thing in .destroy
context.

I'll need to think about that for a while.

Zeroing the buffer while it's being used seems like a problem to
me and was what this patch was trying to avoid.

I thought all that would be needed for this to happen is for a
dentry drop to occur while the link walk was happening after
->get_link() had returned the pointer.

What have I got wrong in that thinking?

If it isn't safe for ext4 to do that, then we have a general
pathwalk problem, not an XFS issue. But, as you say, it is safe to
do this zeroing, so the fix to xfs_ifree() is to zero the link
buffer instead of freeing it, just like ext4 does.

As a side issue, we really don't want to move what XFS does in
.destroy_inode to .free_inode because that then means we need to add
synchronise_rcu() calls everywhere in XFS that might need to wait on
inodes being inactivated and/or reclaimed. And because inode reclaim
uses lockless rcu lookups, there's substantial danger of adding rcu
callback related deadlocks to XFS here. That's just not a direction
we should be moving in.

Another reason I decided to use the ECHILD return instead is that
I thought synchronise_rcu() might add an unexpected delay.

Since synchronise_rcu() will only wait for processes that currently
have the rcu read lock do you think that could actually be a problem
in this code path?

I'll also point out that this would require XFS inodes to pass
through *two* rcu grace periods before the memory they hold could be
freed because, as I mentioned, xfs inode reclaim uses rcu protected
inode lookups and so relies on inodes to be freed by rcu callback...

quoted

I tend to agree with Brian and Ian at this point: return -ECHILD
from
xfs_vn_get_link_inline() until xfs's inode resue vs. rcu walk
implications are fully dealt with.  No way to fix this from VFS
alone.

I disagree from a fundamental process POV - this is just sweeping
the issue under the table and leaving it for someone else to solve
because the root cause of the inode re-use issue has not been
identified. But to the person who architected the lockless XFS inode
cache 15 years ago, it's pretty obvious, so let's just solve it now.

Sorry, I don't understand what you mean by the root cause not
being identified?

Until lockless path walking was introduced this wasn't a problem
because references were held during walks so there could never be
a final dput() to trigger freeing process during a walk. And a lot
of code probably still makes that assumption. And code that does
make that assumption should return -ECHILD in cases like this so
that the VFS can either legitimize the struct path (by taking
references) or restart the walk in ref-walk mode.

Can you elaborate please?

With the xfs_ifree() problem solved by zeroing rather than freeing,
then the only other problem is inode reuse *within an rcu grace
period*. Immediate inode reuse tends to be rare, (we can actually
trace occurrences to validate this assertion), and implementation
wise reuse is isolated to a single function: xfs_iget_recycle().

xfs_iget_recycle() drops the rcu_read_lock() inode lookup context
that found the inode marks it as being reclaimed (preventing other
lookups from finding it), then re-initialises the inode. This is
what makes .get_link change in the middle of pathwalk - we're
reinitialising the inode without waiting for the RCU grace period to
expire.

Ok, good to know that, there's a lot of icache code to look
through, ;)

At this point I come back to thinking the original patch might
be sufficient. But then that's only for xfs and excludes
potential problems with other file systems so I'll not go
there.

The obvious thing to do here is that after we drop the RCU read
context, we simply call synchronize_rcu() before we start
re-initialising the inode to wait for the current grace period to
expire. This ensures that any pathwalk that may have found that
inode has seen the sequence number change and droppped out of
lockless mode and is no longer trying to access that inode.  Then we
can safely reinitialise the inode as it has passed through a RCU
grace period just like it would have if it was freed and
reallocated.

Sounds right to me, as long as it is ok to call synchronize_rcu()
here.

This completely removes the entire class of "reused inodes race with
VFS level RCU walks" bugs from the XFS inode cache implementation,
hence XFS inodes behave the same as all other filesystems w.r.t RCU
grace period expiry needing to occur before a VFS inode is reused.

So, it looks like three patches to fix this entirely:

1. the pathwalk link sequence check fix
2. zeroing the inline link buffer in xfs_ifree()

I'm sorry but I'm really having trouble understanding how this is
ok. If some process is using the buffer to walk a link path how
can zeroing the contents of the buffer be ok?

3. adding synchronize_rcu() (or some variant) to xfs_iget_recycle()

Cheers,

Dave.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help