Re: [PATCH 2/2] xfs: make sure link path does not go away at access

From: Brian Foster <hidden>
Date: 2021-11-22 19:28:12
Also in: linux-fsdevel, lkml
Subsystem: filesystems (vfs and infrastructure), the rest, xfs filesystem · Maintainers: Alexander Viro, Christian Brauner, Linus Torvalds, Carlos Maiolino

On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote:

On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster wrote:

quoted

On Thu, Nov 18, 2021 at 08:48:52AM +1100, Dave Chinner wrote:

quoted

On Wed, Nov 17, 2021 at 01:56:17PM -0500, Brian Foster wrote:

quoted

On Wed, Nov 17, 2021 at 11:22:51AM +1100, Dave Chinner wrote:
Only a single grace period is required to cover
(from the rcuwalk perspective) the entire set of inodes queued for
inactivation. That leaves at least a few fairly straightforward options:

1. Use queue_rcu_work() to schedule the inactivation task. We'd probably
have to isolate the list to process first from the queueing context
rather than from workqueue context to ensure we don't process recently
added inodes that haven't sat for a grace period.

No, that takes too long. Long queues simply mean deferred
inactivation is working on cold CPU caches and that means we take a
30-50% performance hit on inode eviction overhead for inodes that
need inactivation (e.g. unlinked inodes) just by having to load all
the inode state into CPU caches again.

Numbers I recorded at the time indicate that inactivation that
doesn't block on IO or the log typically takes between 200-500us
of CPU time, so the deferred batch sizes are sized to run about
10-15ms worth of deferred processing at a time. Filling a batch
takes memory reclaim about 200uS to fill when running
dispose_list() to evict inodes.

The problem with using RCU grace periods is that they delay the
start of the work for at least 10ms, sometimes hundreds of ms.
Using queue_rcu_work() means we will need to go back to unbound
depth queues to avoid blocking waiting for grace period expiry to
maintain perfomrance. THis means having tens of thousands of inodes
queued for inactivation before the workqueue starts running. These
are the sorts of numbers that caused all the problems Darrick was
having with performance, and that was all cold cache loading
overhead which is unavoidable with such large queue depths....

Hm, Ok. I recall the large queue depth issues on the earlier versions
but had not caught up with the subsequent changes that limit (percpu)
batch size, etc. The cond sync rcu approach is easy enough to hack in
(i.e., sample gp on destroy, cond_sync_rcu() on inactivate) that I ran a
few experiments on opposing ends of the spectrum wrt to concurrency.

The short of it is that I do see about a 30% hit in the single threaded
sustained removal case with current batch sizes. If the workload scales
out to many (64) cpus, the impact dissipates, I suspect because we've
already distributed the workload across percpu wq tasks and thus drive
the rcu subsystem with context switches and other quiescent states that
progress grace periods. The single threaded perf hit mitigates at about
4x the current throttling threshold.

I doubt that thread count increases are actually mitigating the perf
hit. Performance hits hard limits on concurrent rm -rf threads due
to CIL lock contention at 7-800,000 transactions/s
(hence the scalability patchset) regardless of the concurrency of
the workload. With that bottleneck removed, the system then hits
contention limits on VFS locks during
inode instantiation/reclaim. This typically happens at 1.1-1.2
million transactions/s during unlink.

Essentially, if you have a slower per-thread fs modification
workload, you can can increase the concurrency to more threads and
CPUs but the system will eventually still hit the same throughput
limits. Hence a per-thread performance degradataion will still reach
the same peak throughput levels, it will just take a few more
threads to reach that limit. IOWs, scale doesn't make the
per-thread degradation go away, it just allows more threads to run
at full (but degraded) performance before the scalabilty limit
threshold is hit.

All I'm testing for here is how the prospective rcu tweak behaves at
scale. It introduces a noticeable hit with a single thread and no real
noticeable change at scale. FWIW, I see a similar result with a quick
test on top of the log scalability changes. I.e., I see a noticeable
improvement in concurrent rm -rf from a baseline kernel to baseline +
log scalability, and that improvement persists with the rcu hack added
on top.

quoted

We could, potentially, use a separate lockless queue for unlinked
inodes and defer that to after a grace period, but then rm -rf
workloads will go much, much slower.

I don't quite follow what you mean by a separate lockless queue..?

I was thinking separating unlinked symlinks into their own queue
that can be processed after a grace period....

Ok. I had a similar idea in mind, but rather than handle inode types
specially just split the single queue into multiple queues that can
pipeline and potentially mitigate rcu delays. This essentially takes the
previous optimization (that mitigated the hit by about 50%) a bit
further, but I'm not totally sure it's a win without testing it.

quoted

In
any event, another experiment I ran in light of the above results that
might be similar was to put the inode queueing component of
destroy_inode() behind an rcu callback. This reduces the single threaded
perf hit from the previous approach by about 50%. So not entirely
baseline performance, but it's back closer to baseline if I double the
throttle threshold (and actually faster at 4x). Perhaps my crude
prototype logic could be optimized further to not rely on percpu
threshold changes to match the baseline.

My overall takeaway from these couple hacky experiments is that the
unconditional synchronous rcu wait is indeed probably too heavy weight,
as you point out. The polling or callback (or perhaps your separate
queue) approach seems to be in the ballpark of viability, however,
particularly when we consider the behavior of scaled or mixed workloads
(since inactive queue processing seems to be size driven vs. latency
driven).

So I dunno.. if you consider the various severity and complexity
tradeoffs, this certainly seems worth more consideration to me. I can
think of other potentially interesting ways to experiment with
optimizing the above or perhaps tweak queueing to better facilitate
taking advantage of grace periods, but it's not worth going too far down
that road if you're wedded to the "busy inodes" approach.

I'm not wedded to "busy inodes" but, as your experiments are
indicating, trying to handle rcu grace periods into the deferred
inactivation path isn't completely mitigating the impact of having
to wait for a grace period for recycling of inodes.

What I'm seeing so far is that the impact seems to be limited to the
single threaded workload and largely mitigated by an increase in the
percpu throttle limit. IOW, it's not completely free right out of the
gate, but the impact seems isolated and potentially mitigated by
adjustment of the pipeline.

I realize the throttle is a percpu value, so that is what has me
wondering about some potential for gains in efficiency to try and get
more of that single-threaded performance back in other ways, or perhaps
enhancements that might be more broadly beneficial to deferred
inactivations in general (i.e. some form of adaptive throttling
thresholds to balance percpu thresholds against a global threshold).

I suspect a rethink on the inode recycling mechanism is needed. THe
way it is currently implemented was a brute force solution - it is
simple and effective. However, we need more nuance in the recycling
logic now.  That is, if we are recycling an inode that is clean, has
nlink >=1 and has not been unlinked, it means the VFS evicted it too
soon and we are going to re-instantiate it as the identical inode
that was evicted from cache.

Probably. How advantageous is inode memory reuse supposed to be in the
first place? I've more considered it a necessary side effect of broader
architecture (i.e. deferred reclaim, etc.) as opposed to a primary
optimization. I still see reuse occur with deferred inactivation, we
just end up cycling through the same set of inodes as they fall through
the queue rather than recycling the same one over and over. I'm sort of
wondering what the impact would be if we didn't do this at all (for the
new allocation case at least).

So how much re-initialisation do we actually need for that inode?
Almost everything in the inode is still valid; the problems come
from inode_init_always() resetting the entire internal inode state
and XFS then having to set them up again.  The internal state is
already largely correct when we start recycling, and the identity of
the recycled inode does not change when nlink >= 1. Hence eliding
inode_init_always() would also go a long way to avoiding the need
for a RCU grace period to pass before we can make the inode visible
to the VFS again.

If we can do that, then the only indoes that need a grace period
before they can be recycled are unlinked inodes as they change
identity when being recycled. That identity change absolutely
requires a grace period to expire before the new instantiation can
be made visible.  Given the arbitrary delay that this can introduce
for an inode allocation operation, it seems much better suited to
detecting busy inodes than waiting for a global OS state change to
occur...

Maybe..? The experiments I've been doing are aimed at simplicity and
reducing the scope of the changes. Part of the reason for this is tbh
I'm not totally convinced we really need to do anything more complex
than preserve the inline symlink buffer one way or another (for example,
see the rfc patch below for an alternative to the inline symlink rcuwalk
disable patch). Maybe we should consider something like this anyways.

With busy inodes, we need to alter inode allocation to some degree to
accommodate. We can have (tens of?) thousands of inodes under the grace
period at any time based on current batching behavior, so it's not
totally evident to me that we won't end up with some of the same
fundamental issues to deal with here, just needing to be accommodated in
the inode allocation algorithm rather than the teardown sequence.

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 64b9bf334806..058e3fc69ff7 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c

@@ -2644,7 +2644,7 @@ xfs_ifree(
 	 * already been freed by xfs_attr_inactive.
 	 */
 	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
-		kmem_free(ip->i_df.if_u1.if_data);
+		kfree_rcu(ip->i_df.if_u1.if_data);
 		ip->i_df.if_u1.if_data = NULL;
 		ip->i_df.if_bytes = 0;
 	}

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a607d6aca5c4..e98d7f10ba7d 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c

@@ -511,27 +511,6 @@ xfs_vn_get_link(
 	return ERR_PTR(error);
 }
 
-STATIC const char *
-xfs_vn_get_link_inline(
-	struct dentry		*dentry,
-	struct inode		*inode,
-	struct delayed_call	*done)
-{
-	struct xfs_inode	*ip = XFS_I(inode);
-	char			*link;
-
-	ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL);
-
-	/*
-	 * The VFS crashes on a NULL pointer, so return -EFSCORRUPTED if
-	 * if_data is junk.
-	 */
-	link = ip->i_df.if_u1.if_data;
-	if (XFS_IS_CORRUPT(ip->i_mount, !link))
-		return ERR_PTR(-EFSCORRUPTED);
-	return link;
-}
-
 static uint32_t
 xfs_stat_blksize(
 	struct xfs_inode	*ip)

@@ -1250,14 +1229,6 @@ static const struct inode_operations xfs_symlink_inode_operations = {
 	.update_time		= xfs_vn_update_time,
 };
 
-static const struct inode_operations xfs_inline_symlink_inode_operations = {
-	.get_link		= xfs_vn_get_link_inline,
-	.getattr		= xfs_vn_getattr,
-	.setattr		= xfs_vn_setattr,
-	.listxattr		= xfs_vn_listxattr,
-	.update_time		= xfs_vn_update_time,
-};
-
 /* Figure out if this file actually supports DAX. */
 static bool
 xfs_inode_supports_dax(

@@ -1409,9 +1380,8 @@ xfs_setup_iops(
 		break;
 	case S_IFLNK:
 		if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL)
-			inode->i_op = &xfs_inline_symlink_inode_operations;
-		else
-			inode->i_op = &xfs_symlink_inode_operations;
+			inode->i_link = ip->i_df.if_u1.if_data;
+		inode->i_op = &xfs_symlink_inode_operations;
 		break;
 	default:
 		inode->i_op = &xfs_inode_operations;

diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index fc2c6a404647..20ec2f450c56 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c

@@ -497,6 +497,7 @@ xfs_inactive_symlink(
 	 * do here in that case.
 	 */
 	if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) {
+		WRITE_ONCE(VFS_I(ip)->i_link, NULL);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		return 0;
 	}

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help