Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues

[PATCHSET v8 00/20] xfs: deferred inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 01/20] xfs: move xfs_inactive call to xfs_inode_mark_reclaimable · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 02/20] xfs: detach dquots from inode if we don't need to inactivate it · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 03/20] xfs: defer inode inactivation to a workqueue · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-07-30
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-31
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-08-01
Re: [PATCH 03/20] xfs: defer inode inactivation to a workqueue · Dave Chinner <david@fromorbit.com> · 2021-08-01
[PATCH, alternative] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-03
Re: [PATCH, alternative] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-03
[PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
[PATCH] xfs: inodegc needs to stop before freeze · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH] xfs: inodegc needs to stop before freeze · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH] xfs: don't run inodegc flushes when inodegc is not active · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH] xfs: don't run inodegc flushes when inodegc is not active · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, alternative v2] xfs: per-cpu deferred inode inactivation queues · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, pre-03/20 #1] xfs: introduce CPU hotplug infrastructure · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, pre-03/20 #2] xfs: introduce all-mounts list for cpu hotplug notifications · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · Dave Chinner <david@fromorbit.com> · 2021-08-04
Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-04
Re: [PATCH, post-03/20 1/1] xfs: hook up inodegc to CPU dead notification · Dave Chinner <david@fromorbit.com> · 2021-08-04
[PATCH 04/20] xfs: throttle inode inactivation queuing on memory reclaim · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 05/20] xfs: don't throttle memory reclaim trying to queue inactive inodes · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 06/20] xfs: throttle inodegc queuing on backlog · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog · Dave Chinner <david@fromorbit.com> · 2021-08-02
Re: [PATCH 06/20] xfs: throttle inodegc queuing on backlog · Dave Chinner <david@fromorbit.com> · 2021-08-02
[PATCH 07/20] xfs: queue inodegc worker immediately when memory is tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 08/20] xfs: expose sysfs knob to control inode inactivation delay · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 09/20] xfs: reduce inactivation delay when free space is tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 10/20] xfs: reduce inactivation delay when quota are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 11/20] xfs: reduce inactivation delay when realtime extents are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 12/20] xfs: inactivate inodes any time we try to free speculative preallocations · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 13/20] xfs: flush inode inactivation work when compiling usage statistics · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 14/20] xfs: parallelize inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCH 14/20] xfs: parallelize inode inactivation · Dave Chinner <david@fromorbit.com> · 2021-08-02
Re: [PATCH 14/20] xfs: parallelize inode inactivation · "Darrick J. Wong" <djwong@kernel.org> · 2021-08-02
[PATCH 15/20] xfs: reduce inactivation delay when AG free space are tight · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 16/20] xfs: queue inodegc worker immediately on backlog · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 17/20] xfs: don't run speculative preallocation gc when fs is frozen · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 18/20] xfs: scale speculative preallocation gc delay based on free space · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 19/20] xfs: use background worker pool when transactions can't get free space · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
[PATCH 20/20] xfs: avoid buffer deadlocks when walking fs inodes · "Darrick J. Wong" <djwong@kernel.org> · 2021-07-29
Re: [PATCHSET v8 00/20] xfs: deferred inode inactivation · Dave Chinner <david@fromorbit.com> · 2021-08-02

From: Dave Chinner <david@fromorbit.com>
Date: 2021-08-04 21:35:31

On Wed, Aug 04, 2021 at 08:59:52AM -0700, Darrick J. Wong wrote:

On Wed, Aug 04, 2021 at 09:09:16PM +1000, Dave Chinner wrote:

quoted

On Tue, Aug 03, 2021 at 08:20:30PM -0700, Darrick J. Wong wrote:

quoted

For everyone else following along at home, I've posted the current draft
version of this whole thing in:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=deferred-inactivation-5.15

Overall looks good - fixes to freeze problems I hit are found
in other replies to this.

I omitted the commits:

xfs: queue inodegc worker immediately when memory is tight
xfs: throttle inode inactivation queuing on memory reclaim

in my test kernel because I think they are unnecessary.

I think the first is unnecessary because reclaim of inodes from the
VFS is usually in large batches and so early triggers aren't
desirable when we're getting thousands of inodes being evicted by
the superblock shrinker at a time. If we've only got a handful of
inodes queued, then inactivating them early isn't going to make much
of an impact on free memory. I could be wrong, but so far I have no
evidence that expediting inactivation is necessary.

I think this was a lot more necessary under the old design because I let
the number of tagged inodes grow quite large before triggering gc work,
much less throttling anything.  256 is low enough that it should be
manageable.

But the next patch in the series prevents the shrinkers from
blocking on the hard throttle, yes? So the hard limit throttling
queuing isn't something memory reclaim relies on, either. What will
have an impact is the cond_resched() we place in shrinker execution.
Whenever the VFS inode reclaim eviction list processing hits one of
those and we have a queued deferred work, it will switch away from
the shrinker to run inactivation on that CPU.

So, in reality, we are still throttling and blocking direct reclaim
with the deferred processing. It's just that we are doing it
implicitly at defined reschedule points in the shrinker rather than
doing it directly inline by blocking during a modification
transaction. This also means that if inactivation does block, the
reclaim process can keep running and queuing/reclaiming more ex-VFS
inodes. IOWs, running inactivation like this should help improve
reclaim behaviour and reduce reclaim scan latencies without having
reclaim run out of control...

Does it matter that we no longer inactivate inodes in inode number
order?  I guess it could be nice to be able to dump inode cluster
buffers as soon as practicable, but OTOH I suspect that only matters for
the case of mass deletion, in which case we'll probably catch up soon
enough?

Anyway, I'll try turning both of these off with my silly deltree
exerciser and see what happens.

I haven't seen anything that makes it necessary so in the absence of
simplifying this as much as possible, I want to remove this stuff.
We can always add it back in (easily) if something turns up and we
find this is the cause.

quoted

The second patch is the custom shrinker. Again, I just don't think
this is necessary because if there is any amount of inactivation of
evicted inodes needed due to reclaim, we'll already be triggering it
to run via the deferred queue flush thresholds. Hence we don't
really need any mechanism to tell us that there is memory pressure;
the deferred work reacts to eviction from reclaim in exactly the
same way it reacts to eviction from unlink....

Yep.  I came to the same conclusion last night; it looks like my fast
fstests setup for that passed.

quoted

I've been running the patchset without these two patches on my 512MB
test VM, and the only OOM kill I get from fstests is g/531. This is
the "many open-but-unlinked" test, which creates 50,000 open
unlinked files per CPU. So for this test VM which has 4 CPUs, that's
200,000 open, dirty iunlinked inodes and a lot of pinned inode
cluster buffers. At ~2kB of memory per unlinked inode (ignoring the
cluster buffers) this would consume about 400MB of the 512MB of RAM
the VM has. It OOM kills the test programs that hold the open files
long before it gets to 200,000 files, so this test never passed
before this patchset on this machine...

Yeah... I actually tried running fstests on a 512M VM and whooeee did I
see a lot of OOM kills.  Clearly we've all gotten spoiled by cheap DRAM.

fstests does a lot of stuff that requires memory to complete. THe
filesystem itself will run in much less RAM, but it's stuff like
requiring caching of hundreds of MB of inodes when you don't have
hundreds of MB of RAM that causes the problems.

I will note g/531 does try to limit the number of open files based
on /proc/sys/fs/file-max, but as we found out last night on #xfs,
systemd now unconditionally sets that to 2^63 - 1 and so it breaks
any attempt to size the fileset based on the kernel's RAM size based
default file-max setting...

quoted

I have a couple of extra patches to set up per-cpu hotplug
infrastructure before the deferred inode inactivation patch - I'll
post them after I finish this email. I'm going to leave it running
tests overnight.

Ok, I'll jam those on the front end of the series.

quoted

Darrick, I'm pretty happy with the way the patchset is behaving now.
If you want to fold in the bug fixes I've posted and add in
the hotplug patches, then I think it's ready to be posted in full
again (if it all passes your testing) for review.

It's probably about time for that.  Now that we do percpu thingies, I
think it might also be time for a test that runs fstests while plugging
and unplugging the non-bsp processors.

Yeah, I haven't tested the CPU dead notification much at all. It
should work, but...

[narrator: ...and thus he unleashed another terrifying bug mountain]

... yeah, this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help