Re: [PATCH 2/3] writeback: allow for dirty metadata accounting

From: Dave Chinner <david@fromorbit.com>
Date: 2016-09-12 23:01:52
Also in: linux-btrfs, linux-fsdevel

On Mon, Sep 12, 2016 at 10:24:12AM -0400, Josef Bacik wrote:

Dave your reply got eaten somewhere along the way for me, so all i
have is this email.  I'm going to respond to your stuff here.

No worries, I'll do a 2-in-1 reply :P

On 09/12/2016 03:34 AM, Jan Kara wrote:

quoted

On Mon 12-09-16 10:46:56, Dave Chinner wrote:

quoted

On Fri, Sep 09, 2016 at 10:17:43AM +0200, Jan Kara wrote:

quoted

On Mon 22-08-16 13:35:01, Josef Bacik wrote:

quoted

Provide a mechanism for file systems to indicate how much dirty metadata they
are holding.  This introduces a few things

1) Zone stats for dirty metadata, which is the same as the NR_FILE_DIRTY.
2) WB stat for dirty metadata.  This way we know if we need to try and call into
the file system to write out metadata.  This could potentially be used in the
future to make balancing of dirty pages smarter.

So I'm curious about one thing: In the previous posting you have mentioned
that the main motivation for this work is to have a simple support for
sub-pagesize dirty metadata blocks that need tracking in btrfs. However you
do the dirty accounting at page granularity. What are your plans to handle
this mismatch?

The thing is you actually shouldn't miscount by too much as that could
upset some checks in mm checking how much dirty pages a node has directing
how reclaim should be done... But it's a question whether NR_METADATA_DIRTY
should be actually used in the checks in node_limits_ok() or in
node_pagecache_reclaimable() at all because once you start accounting dirty
slab objects, you are really on a thin ice...

The other thing I'm concerned about is that it's a btrfs-only thing,
which means having dirty btrfs metadata on a system with different
filesystems (e.g. btrfs root/home, XFS data) is going to affect how
memory balance and throttling is run on other filesystems. i.e. it's
going ot make a filesystem specific issue into a problem that
affects global behaviour.

Umm, I don't think it will be different than currently. Currently btrfs
dirty metadata is accounted as dirty page cache because they have this
virtual fs inode which owns all metadata pages.

Yes, so effectively they are treated the same as file data pages
w.r.t. throttling, writeback and reclaim....

quoted

It is pretty similar to
e.g. ext2 where you have bdev inode which effectively owns all metadata
pages and these dirty pages account towards the dirty limits. For ext4
things are more complicated due to journaling and thus ext4 hides the fact
that a metadata page is dirty until the corresponding transaction is
committed.  But from that moment on dirty metadata is again just a dirty
pagecache page in the bdev inode.

Yeah, though those filesystems don't suffer from the uncontrolled
explosion of metadata that btrfs is suffering from, so simply
treating them as another dirty inode that needs flushing works just
fine.

quoted

So current Josef's patch just splits the counter in which btrfs metadata
pages would be accounted but effectively there should be no change in the
behavior.

Yup, I missed the addition to the node_pagecache_reclaimable() that
ensures reclaim sees the same number or dirty pages...

quoted

It is just a question whether this approach is workable in the
future when they'd like to track different objects than just pages in the
counter.

I don't think it can. Because the counters directly influences the
page lru reclaim scanning algorithms, it can only be used to
account for pages that are in the LRUs. Other objects like slab
objects need to be accounted for and reclaimed by the shrinker
infrastructure.

Accounting for metadata writeback is a different issue - it could
track slab objects if we wanted to, but the issue is that these are
often difficult to determine the amount of IO needed to clean them
so generic balancing is hard. (e.g. effect of inode write
clustering).

+1 to what Jan said.  Btrfs's dirty metadata is always going to
affect any other file systems in the system, no matter how we deal
with it.  In fact it's worse with our btree_inode approach as the
dirtied_when thing will likely screw somebody and make us skip
writing out dirty metadata when we want to.

XFS takes care of metadata flushing with a periodic background work
controlled by /proc/sys/fs/xfs/xfssyncd_centisecs. We trigger both
background async inode reclaim and background dirty metadata
flushing from this (run on workqueues) if the system is idle or
hasn't had some other trigger fire to run these sooner.  It works
well enough that I can't remember the last time someone asked a
question about needing to tune this parameter, or had a problem that
required tuning it to fix....

At least with this
framework in place we can start to make the throttling smarter, so
say make us flush metadata if that is the bigger % of the dirty
pages in the system.  All I do now is move the status quo around, we
are no worse, and arguably better with these patches than we were
without them.

Agreed - I misread part of the patch, so I was a little off-target.

quoted

OK, thanks for providing the details about XFS. So my idea was (and Josef's
patches seem to be working towards this), that filesystems that decide to
use the generic throttling, would just account the dirty metadata in some
counter. That counter would be included in the limits checked by
balance_dirty_pages(). Filesystem that don't want to use generic throttling
would have the counter 0 and thus for them there'd be no difference. And in
appropriate points after metadata was dirtied, filesystems that care could
call balance_dirty_pages() to throttle the process creating dirty
metadata.

quoted

I can see how tracking of information such as the global amount of
dirty metadata is useful for diagnostics, but I'm not convinced we
should be using it for globally scoped external control of deeply
integrated and highly specific internal filesystem functionality.

You are right that when journalling comes in, things get more complex. But
btrfs already uses a scheme similar to the above and I believe ext4 could
be made to use a similar scheme as well. If you have something better for
XFS, then that's good for you and we should make sure we don't interfere
with that. Currently I don't think we will.

XFS doesn't have the problem that btrfs has, so I don't expect it to
take advantage of this.  Your writeback throttling is tied to your
journal transactions, which are already limited in size.  Btrfs on
the other hand is only limited by the amount of memory in the
system, which is why I want to leverage the global throttling code.
Btrfs doesn't have the benefit of a arbitrary journal size
constraint on it's dirty metadata, so we have to rely on the global
limits to make sure we're kept in check.

Though you do have a transaction reservation system, so I would
expect that be the place where a dirty metadata throttling could be
applied sanely. i.e. if there's too much dirty metadata in the
system, then you can't get a reservation to dirty more until some
of the metadata has been cleaned.

The only thing my patches
do is allow us to account for this separately and trigger writeback
specifically.  Thanks,

I'm not sure that global metadata accounting makes sense if there is
no mechanism provided for per-sb throttling based purely on metadata
demand. Sure, global dirty metadata might be a way of measuring how
much metadata all the active filesystems have created, but then
you're going to throttle filesystems with no dirty metadata because
some other filesystem is being a massive hog.

I suspect that for a well balanced system we are going to need some
kind of "in-between" solution that shares the dirty metadata limits
across all bdis somewhat fairly similar to file data. How that ties
into the bdi writeback rate estimates and other IO/dirty page
throttling feedback loops is an interesting question - to me this
isn't as obviously simple as it simply separating out metadata
accounting...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help