Re: [PATCH 0/4] btrfs: shrink delalloc fixes

From: Josef Bacik <josef@toxicpanda.com>
Date: 2021-06-14 18:39:14

On 6/14/21 7:13 AM, Johannes Thumshirn wrote:

On 11/06/2021 15:54, Josef Bacik wrote:

quoted

Hello,

I backported the patch to switch us to using sync_inode() to our kernel inside
Facebook to fix a deadlock when using the async delalloc shrinker threads.  This
uncovered a bunch of problems with how we shrink delalloc, as we use -o
compress-force, and thus everything goes through the async compression threads.

I ripped out the async pages stuff because originally I had switched us to just
writing the whole inode.  This caused a performance regression, and so I
switched us to calling sync_inode() twice to handle the async extent case.  The
problem is that sync_inode() can skip writing inodes sometimes, and thus we
weren't properly waiting on the async helpers.  We really do need to wait for
the async delalloc pages to go down before continuing to shrink delalloc.  There
was also a race in how we woke up the async delalloc pages waiter which could be
problematic.

And then finally there is our use of sync_inode().  It tries to be too clever
for us, when in reality we want to make sure all pages are under writeback
before we come back to the shrinking loop.  I've added a small helper to give us
this flexibilty and have switched us to that helper.

With these patches, and others that will be sent separately, the early ENOSPC
problems we were experiencing have been eliminated.  Thanks,

Josef Bacik (4):
   btrfs: wait on async extents when flushing delalloc
   btrfs: wake up async_delalloc_pages waiters after submit
   fs: add a filemap_fdatawrite_wbc helper
   btrfs: use the filemap_fdatawrite_wbc helper for delalloc shrinking

  fs/btrfs/inode.c      | 16 ++++++----------
  fs/btrfs/space-info.c | 33 +++++++++++++++++++++++++++++++++
  include/linux/fs.h    |  2 ++
  mm/filemap.c          | 29 ++++++++++++++++++++++++-----
  4 files changed, 65 insertions(+), 15 deletions(-)

I've ran this through xfstests on a zoned device and got the following lockdep
splat. I'm not sure if this is your fault or my fault. btrfs_reclaim_bgs_work()
very much sounds like my fault. Writing it here for completeness anyways.
Otherwise no noticeable deviations from the baseline.

[  634.999371] BTRFS info (device nullb1): reclaiming chunk 3489660928 with 22% used
[  634.999445] BTRFS info (device nullb1): relocating block group 3489660928 flags metadata
                                                                                            
[  642.127246] ======================================================
[  642.127876] WARNING: possible circular locking dependency detected
[  642.128526] 5.13.0-rc5-josef-delalloc #1080 Not tainted
[  642.129060] ------------------------------------------------------
[  642.129699] kworker/u4:5/11096 is trying to acquire lock:
[  642.130259] ffff888278839438 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.131321]
                but task is already holding lock:
[  642.131921] ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.132926]
                which lock already depends on the new lock.
                                                                                                                                                                                       
[  642.133767]
                the existing dependency chain (in reverse order) is:
[  642.134529]
                -> #2 (btrfs-tree-01#2/1){+.+.}-{3:3}:
[  642.135181]        down_write_nested+0x23/0x60
[  642.135647]        __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.136238]        btrfs_init_new_buffer+0x7d/0x2a0 [btrfs]
[  642.136844]        btrfs_alloc_tree_block+0x113/0x320 [btrfs]
[  642.137460]        alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
[  642.138115]        __btrfs_cow_block+0x13b/0x5e0 [btrfs]
[  642.138688]        btrfs_cow_block+0x114/0x220 [btrfs]
[  642.139296]        btrfs_search_slot+0x587/0x950 [btrfs]
[  642.139907]        btrfs_lookup_inode+0x2a/0x90 [btrfs]
[  642.140501]        __btrfs_update_delayed_inode+0x80/0x2d0 [btrfs]
[  642.141177]        btrfs_async_run_delayed_root+0x174/0x240 [btrfs]
[  642.141878]        btrfs_work_helper+0xfe/0x530 [btrfs]
[  642.142454]        process_one_work+0x24f/0x570
[  642.142931]        worker_thread+0x4f/0x3e0
[  642.143357]        kthread+0x12c/0x150
[  642.143748]        ret_from_fork+0x22/0x30
[  642.144181]
                -> #1 (btrfs-tree-01#2){++++}-{3:3}:
[  642.144807]        down_write_nested+0x23/0x60
[  642.145260]        __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.145839]        btrfs_search_slot+0x259/0x950 [btrfs]
[  642.146413]        do_relocation+0xf9/0x5d0 [btrfs]
[  642.146960]        relocate_tree_blocks+0x293/0x610 [btrfs]
[  642.147579]        relocate_block_group+0x1f2/0x560 [btrfs]
[  642.148201]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]
[  642.148869]        btrfs_relocate_chunk+0x38/0x120 [btrfs]
[  642.149479]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.150162]        process_one_work+0x24f/0x570
[  642.150628]        worker_thread+0x4f/0x3e0
[  642.151062]        kthread+0x12c/0x150
[  642.151450]        ret_from_fork+0x22/0x30
[  642.151872]
                -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
[  642.152518]        __lock_acquire+0x1235/0x2320
[  642.152985]        lock_acquire+0xab/0x340
[  642.153409]        down_write_nested+0x23/0x60
[  642.153872]        __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.154455]        btrfs_lock_root_node+0x31/0x40 [btrfs]
[  642.155064]        btrfs_search_slot+0x6bc/0x950 [btrfs]
[  642.155639]        replace_path+0x56f/0xa30 [btrfs]
[  642.156181]        merge_reloc_root+0x258/0x600 [btrfs]
[  642.156773]        merge_reloc_roots+0xdd/0x210 [btrfs]
[  642.157368]        relocate_block_group+0x2c9/0x560 [btrfs]
[  642.157986]        btrfs_relocate_block_group+0x16c/0x320 [btrfs]
[  642.158660]        btrfs_relocate_chunk+0x38/0x120 [btrfs]
[  642.159272]        btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.159951]        process_one_work+0x24f/0x570
[  642.160418]        worker_thread+0x4f/0x3e0
[  642.160850]        kthread+0x12c/0x150
[  642.161243]        ret_from_fork+0x22/0x30
[  642.161664]
                other info that might help us debug this:

[  642.162480] Chain exists of:
                  btrfs-treloc-02#2 --> btrfs-tree-01#2 --> btrfs-tree-01#2/1

[  642.163616]  Possible unsafe locking scenario:

[  642.164226]        CPU0                    CPU1
[  642.164699]        ----                    ----
[  642.165164]   lock(btrfs-tree-01#2/1);
[  642.165553]                                lock(btrfs-tree-01#2);
[  642.166183]                                lock(btrfs-tree-01#2/1);
[  642.166820]   lock(btrfs-treloc-02#2);
[  642.167209]
                 *** DEADLOCK ***

[  642.167812] 6 locks held by kworker/u4:5/11096:
[  642.168288]  #0: ffff8881000c4938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
[  642.169317]  #1: ffffc900046e3e78 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c6/0x570
[  642.170507]  #2: ffff888136c6a0a0 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_reclaim_bgs_work+0x5d/0x1b0 [btrfs]
[  642.171687]  #3: ffff888136c68838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x164/0x320 [btrfs]
[  642.172877]  #4: ffff88811d11f620 (sb_internal){.+.+}-{0:0}, at: merge_reloc_root+0x178/0x600 [btrfs]
[  642.173871]  #5: ffff8882bdc45cf8 (btrfs-tree-01#2/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.174906]
[  642.175362] CPU: 0 PID: 11096 Comm: kworker/u4:5 Not tainted 5.13.0-rc5-josef-delalloc #1080
[  642.176215] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
[  642.177366] Workqueue: events_unbound btrfs_reclaim_bgs_work [btrfs]
[  642.178065] Call Trace:
[  642.178323]  dump_stack+0x6d/0x89
[  642.178670]  check_noncircular+0xcf/0xf0
[  642.179085]  __lock_acquire+0x1235/0x2320
[  642.179509]  lock_acquire+0xab/0x340
[  642.179889]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.180442]  ? find_held_lock+0x2b/0x80
[  642.180843]  ? btrfs_root_node+0x93/0x1d0 [btrfs]
[  642.181359]  down_write_nested+0x23/0x60
[  642.181781]  ? __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.182327]  __btrfs_tree_lock+0x28/0x120 [btrfs]
[  642.182862]  btrfs_lock_root_node+0x31/0x40 [btrfs]
[  642.183411]  btrfs_search_slot+0x6bc/0x950 [btrfs]
[  642.183940]  ? release_extent_buffer+0x111/0x160 [btrfs]
[  642.184526]  replace_path+0x56f/0xa30 [btrfs]
[  642.185019]  merge_reloc_root+0x258/0x600 [btrfs]
[  642.185557]  merge_reloc_roots+0xdd/0x210 [btrfs]
[  642.186087]  relocate_block_group+0x2c9/0x560 [btrfs]
[  642.186658]  btrfs_relocate_block_group+0x16c/0x320 [btrfs]
[  642.187283]  btrfs_relocate_chunk+0x38/0x120 [btrfs]
[  642.187839]  btrfs_reclaim_bgs_work.cold+0x5d/0x10f [btrfs]
[  642.188472]  process_one_work+0x24f/0x570
[  642.188894]  worker_thread+0x4f/0x3e0
[  642.189273]  ? process_one_work+0x570/0x570
[  642.189714]  kthread+0x12c/0x150
[  642.190053]  ? __kthread_bind_mask+0x60/0x60
[  642.190492]  ret_from_fork+0x22/0x30
[  644.827436] BTRFS info (device nullb1): found 3645 extents, stage: move data extents
[  645.132633] BTRFS info (device nullb1): reclaiming chunk 4026531840 with 22% used

This doesn't seem to be either of our faults, unless you changed the locking in 
replace_path.  The locking order is

reloc root -> real root

I'm actually not sure how this happens looking at the code, we clearly come in 
with the path pointing at the reloc root, and then we by hand walk down the real 
root.  At the point that we unlock the path and search down the real root again 
we shouldn't be holding any more reloc root things.  So this requires more 
investigation, because something fishy is happening.  Thanks,

Josef

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help