Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue

[PATCHSET v8 00/20] xfs: deferred inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 03/20] xfs: defer inode inactivation to a workqueue · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-07-30
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-31
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-08-01
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-08-01
[PATCH, alternative] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-03
Re: [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-03
[PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
[PATCH] xfs: inodegc needs to stop before freeze · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH] xfs: inodegc needs to stop before freeze · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH] xfs: don't run inodegc flushes when inodegc is not active · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH] xfs: don't run inodegc flushes when inodegc is not active · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 06/20] xfs: throttle inodegc queuing on backlog · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog · Dave Chinner <david@fromorbit.com> · 2021-08-02
Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog · Dave Chinner <david@fromorbit.com> · 2021-08-02
[PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 09/20] xfs: reduce inactivation delay when free space is tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 10/20] xfs: reduce inactivation delay when quota are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 11/20] xfs: reduce inactivation delay when realtime extents are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 14/20] xfs: parallelize inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 14/20] xfs: parallelize inode inactivation · Dave Chinner <david@fromorbit.com> · 2021-08-02
Re: [PATCH 14/20] xfs: parallelize inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-02
[PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 16/20] xfs: queue inodegc worker immediately on backlog · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 19/20] xfs: use background worker pool when transactions can't get free space · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCHSET v8 00/20] xfs: deferred inode inactivation · Dave Chinner <david@fromorbit.com> · 2021-08-02

From: Dave Chinner <david@fromorbit.com>
Date: 2021-08-01 21:49:16

On Fri, Jul 30, 2021 at 09:21:12PM -0700, Darrick J. Wong wrote:

On Fri, Jul 30, 2021 at 02:24:00PM +1000, Dave Chinner wrote:

quoted

On Thu, Jul 29, 2021 at 11:44:10AM -0700, Darrick J. Wong wrote:

quoted

From: Darrick J. Wong <djwong@kernel.org>

Instead of calling xfs_inactive directly from xfs_fs_destroy_inode,
defer the inactivation phase to a separate workqueue.  With this change,
we can speed up directory tree deletions by reducing the duration of
unlink() calls to the directory and unlinked list updates.

By moving the inactivation work to the background, we can reduce the
total cost of deleting a lot of files by performing the file deletions
in disk order instead of directory entry order, which can be arbitrary.

We introduce two new inode flags -- NEEDS_INACTIVE and INACTIVATING.
The first flag helps our worker find inodes needing inactivation, and
the second flag marks inodes that are in the process of being
inactivated.  A concurrent xfs_iget on the inode can still resurrect the
inode by clearing NEEDS_INACTIVE (or bailing if INACTIVATING is set).

Unfortunately, deferring the inactivation has one huge downside --
eventual consistency.  Since all the freeing is deferred to a worker
thread, one can rm a file but the space doesn't come back immediately.
This can cause some odd side effects with quota accounting and statfs,
so we flush inactivation work during syncfs in order to maintain the
existing behaviors, at least for callers that unlink() and sync().

For this patch we'll set the delay to zero to mimic the old timing as
much as possible; in the next patch we'll play with different delay
settings.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>

.....

quoted

+
+/* Disable the inode inactivation background worker and wait for it to stop. */
+void
+xfs_inodegc_stop(
+	struct xfs_mount	*mp)
+{
+	if (!test_and_clear_bit(XFS_OPFLAG_INODEGC_RUNNING_BIT, &mp->m_opflags))
+		return;
+
+	cancel_delayed_work_sync(&mp->m_inodegc_work);
+	trace_xfs_inodegc_stop(mp, __return_address);
+}

FWIW, this introduces a new mount field that does the same thing as the
m_opstate field I added in my feature flag cleanup series (i.e.
atomic operational state changes).  Personally I much prefer my
opstate stuff because this is state, not flags, and the namespace is
much less verbose...

Yes, well, is that ready to go?  Like, right /now/?  I already bolted
the quotaoff scrapping patchset on the front, after reworking the ENOSPC
retry loops and reworking quota apis before that...

Should be - that's why it's in my patch stack getting tested. But I
wasn't suggesting that you need to put it in first, just trying to
give you the heads up that there's a substantial conflict between
that and this patchset.

quoted

THere's also conflicts all over the place because of that. All the
RO checks are busted,

Can we focus on /this/ patchset, then?  What specifically is broken
about the ro checking in it?

Sory, I wasn't particularly clear about that. What I meant was that
stuff like all the new RO and shutdown checks in this patchset don't
give conflicts but instead cause compilation failures. So the merge
isn't just a case of fixing conflicts, the code doesn't compile
(i.e. it is busted) after fixing all the reported merge conflicts.

And since the shrinkers are always a source of amusement, what /is/ up
with it?  I don't really like having to feed it magic numbers just to
get it to do what I want, which is ... let it free some memory in the
first round, then we'll kick the background workers when the priority
bumps (er, decreases), and hope that's enough not to OOM the box.

Well, the shrinkers are intended as a one-shot memory pressure
notification as you are trying to use them. They are intended to be
told the amount of work that needs to be done to free memory, and
they they calculate how much of that work should be done based on
it's idea of the current level of memory pressure.

One shot shrinker triggers never tend to work well because they
treat all memory pressure the same - very light memory pressure is
dead with by the same big hammer than deals with OOM levels of
memory pressure.

As it is, I'm more concerned right now with finding out why there's
such large performance regressions in highly concurrent recursive
chmod/unlink workloads. I spend most of friday looking at it trying
to work out what behaviour was causing the regression, but I haven't
isolated it yet. I suspect that it is lockstepping between user
processes and background inactivation for the unlink - I'm seeing
the backlink rhashtable show up in the profiles which indicates the
unlinked list lengths are an issue and we're lockstepping the AGI.
It also may simply be that there is too much parallelism hammering
the transaction subsystem now....

IOWs, I'm basically going to have to pull this apart patch by patch
to tease out where the behaviours go wrong and see if there are ways
to avoid and mitigate those behaviours.  Hence I haven't even got to
the shrinker/oom considerations yet; there's a bigger performance
issue that needs to be understood first. It may be that they are
related, but right now we need to know why recursive chmod is
saw-toothing (it's not a lack of log space!) and concurrent unlinks
throughput has dropped by half...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help