Thread (23 messages) 23 messages, 7 authors, 2023-08-28

Re: 6.5.0rc5 fs hang - ext4? raid?

From: Dr. David Alan Gilbert <hidden>
Date: 2023-08-16 01:32:32
Also in: linux-ext4, lkml

(Copying in Christoph and Jens)

* Dr. David Alan Gilbert (dave@treblig.org) wrote:
* Dr. David Alan Gilbert (dave@treblig.org) wrote:
quoted
* Theodore Ts'o (tytso@mit.edu) wrote:
quoted
On Mon, Aug 14, 2023 at 09:02:53PM +0000, Dr. David Alan Gilbert wrote:
quoted
dg         29594   29592  0 18:40 pts/0    00:00:00 /usr/bin/ar --plugin /usr/libexec/gcc/x86_64-redhat-linux/13/liblto_plugin.so -csrDT src/intel/perf/libintel_perf.a src/intel/perf/libintel_perf.a.p/meson-generated_.._intel_perf_metrics.c.o src/intel/perf/libintel_perf.a.p/intel_perf.c.o src/intel/perf/libintel_perf.a.p/intel_perf_query.c.o src/intel/perf/libintel_perf.a.p/intel_perf_mdapi.c.o

[root@dalek dg]# cat /proc/29594/stack 
[<0>] md_super_wait+0xa2/0xe0
[<0>] md_bitmap_unplug+0xd2/0x120
[<0>] flush_bio_list+0xf3/0x100 [raid1]
[<0>] raid1_unplug+0x3b/0xb0 [raid1]
[<0>] __blk_flush_plug+0xd7/0x150
[<0>] blk_finish_plug+0x29/0x40
[<0>] ext4_do_writepages+0x401/0xc90
[<0>] ext4_writepages+0xad/0x180
If you want a few seconds and try grabbing cat /proc/29594/stack
again, what does the stack trace stay consistent as above?
I'll get back to that and retry it.
Yeh, the stack is consistent; this time around it's an 'ar' in a kernel
build:

[root@dalek dg]# cat /proc/17970/stack
[<0>] md_super_wait+0xa2/0xe0
[<0>] md_bitmap_unplug+0xad/0x120
[<0>] flush_bio_list+0xf3/0x100 [raid1]
[<0>] raid1_unplug+0x3b/0xb0 [raid1]
[<0>] __blk_flush_plug+0xd7/0x150
[<0>] blk_finish_plug+0x29/0x40
[<0>] ext4_do_writepages+0x401/0xc90
[<0>] ext4_writepages+0xad/0x180
[<0>] do_writepages+0xd2/0x1e0
[<0>] filemap_fdatawrite_wbc+0x63/0x90
[<0>] __filemap_fdatawrite_range+0x5c/0x80
[<0>] ext4_release_file+0x74/0xb0
[<0>] __fput+0xf5/0x2a0
[<0>] task_work_run+0x5d/0x90
[<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
[<0>] syscall_exit_to_user_mode+0x1b/0x40
[<0>] do_syscall_64+0x6c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[root@dalek dg]# cat /proc/17970/stack
[<0>] md_super_wait+0xa2/0xe0
[<0>] md_bitmap_unplug+0xad/0x120
[<0>] flush_bio_list+0xf3/0x100 [raid1]
[<0>] raid1_unplug+0x3b/0xb0 [raid1]
[<0>] __blk_flush_plug+0xd7/0x150
[<0>] blk_finish_plug+0x29/0x40
[<0>] ext4_do_writepages+0x401/0xc90
[<0>] ext4_writepages+0xad/0x180
[<0>] do_writepages+0xd2/0x1e0
[<0>] filemap_fdatawrite_wbc+0x63/0x90
[<0>] __filemap_fdatawrite_range+0x5c/0x80
[<0>] ext4_release_file+0x74/0xb0
[<0>] __fput+0xf5/0x2a0
[<0>] task_work_run+0x5d/0x90
[<0>] exit_to_user_mode_prepare+0x1e6/0x1f0
[<0>] syscall_exit_to_user_mode+0x1b/0x40
[<0>] do_syscall_64+0x6c/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
quoted
quoted
Also, if you have iostat installed (usually part of the sysstat
package), does "iostat 1" show any I/O activity on the md device?
iostat is showing something odd, most devices are at 0,
except for 3 of the dm's that are stuck at 100% utilisation with
apparently nothing going on:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.06    0.00    0.03   53.06    0.00   46.84

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
...
dm-16            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
dm-17            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
dm-18            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-19            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-20            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00 100.00
....

dm-20 is the /dev/mapper/main-more which is the RAID on which the
fs runs, 16 and 17 are main-more_rmeta_0 and main-more_rimage_0
so something screwy is going on there.
I've just finished a bisect of this hang, and got to:

615939a2ae734e3e68c816d6749d1f5f79c62ab7 is the first bad commit
commit 615939a2ae734e3e68c816d6749d1f5f79c62ab7
Author: Christoph Hellwig [off-list ref]
Date:   Fri May 19 06:40:48 2023 +0200

    blk-mq: defer to the normal submission path for post-flush requests

    Requests with the FUA bit on hardware without FUA support need a post
    flush before returning to the caller, but they can still be sent using
    the normal I/O path after initializing the flush-related fields and
    end I/O handler.

    Signed-off-by: Christoph Hellwig [off-list ref]
    Reviewed-by: Bart Van Assche [off-list ref]
    Link: https://lore.kernel.org/r/20230519044050.107790-6-hch@lst.de (local)
    Signed-off-by: Jens Axboe [off-list ref]

the full set of runs is:
 2 good: 6995e2de6891c724bfeb2db33d7b87775f913ad1 v6.4
 7 good  0aa69d53ac7c30f6184f88f2e310d808b32b35a5 6.4.0dg-00428-g0aa69d53ac7c
12 good  c1075e548ce6e6b5c7b71f2b05d344164ebc52bb 6.4.0-rc2dg-00020-gc1075e548ce6
15 good  be4c427809b0a746aff54dbb8ef663f0184291d0 6.4.0-rc2dg-00022-gbe4c427809b0
17 bad   615939a2ae734e3e68c816d6749d1f5f79c62ab7 6.4.0-rc2dg-00023-g615939a2ae73
16 bad   1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 6.4.0-rc2dg-00024-g1e82fadfc6b9
14 bad   9a67aa52a42b31ad44220cc218df3b75a5cd5d05 6.4.0-rc2dg-00025-g9a67aa52a42b
13 bad   38f2dd34410f9070b60969a07ff7d8743b4fd56c 6.4.0-rc2dg-00030-g38f2dd34410f
11 bad   a450f49708ea2ccabd1c5d2fe8a702ca5ef77941 6.4.0-rc2dg-00071-ga450f49708ea
10 bad   a4f75764d16bed317276b05a9fe2c179ef61680d 6.4.0-rc2dg-00112-ga4f75764d16b
 9 bad   658afed19ceed54a52b9e9e69c0791c8868ff55d 6.4.0-rc2dg-00193-g658afed19cee
 8 bad   7cffdbe3607a6cc2dc02d135e13732ec36bc4e28 6.4.0dg-00752-g7cffdbe3607a
 6 bad   2605e80d3438c77190f55b821c6575048c68268e 6.4.0dg-01094-g2605e80d3438
 5 bad   6e17c6de3ddf3073741d9c91a796ee696914d8a0 6.4.0dg-02143-g6e17c6de3ddf
 4 bad   3a8a670eeeaa40d87bd38a587438952741980c18 6.4.0dg-04247-g3a8a670eeeaa
 3 bad   1b722407a13b7f8658d2e26917791f32805980a2   6.4.0dg-07066-g1b722407a13b
 1 bad: 52a93d39b17dc v6.5-rc5

a sysrq-w from build 16 ( 1e82fadfc6b96ca79f69d0bcf938d31032bb43d2 )
s below.

This is very repeatable for me;
  a) boot into desktop
  b) Start rhythmbox playing
  c) Open 2 terminals
  d) In one start    make clean on kernel tree
  e) in other start  iostat -x 1
  f) Once the make clean is done,  make -j 32
  g) If that succeeds  try  ag  somejunk   in kernel tree (i.e. parallel
  grep)

It's not clear to me if starting rhythmbox is actually necessary;
but with that recipe it often fails in a few seconds.

Let me know if there's anything you'd like me to try.

Dave

[   90.614548] sysrq: Show Blocked State
[   90.615300] task:kworker/u64:3   state:D stack:0     pid:232   ppid:2      flags:0x00004000
[   90.615307] Workqueue: writeback wb_workfn (flush-253:20)
[   90.615315] Call Trace:
[   90.615317]  <TASK>
[   90.615321]  __schedule+0x3eb/0x14c0
[   90.615327]  ? update_sd_lb_stats.constprop.0+0x135/0x8d0
[   90.615334]  schedule+0x5e/0xd0
[   90.615336]  io_schedule+0x46/0x70
[   90.615338]  bit_wait_io+0x11/0x70
[   90.615341]  __wait_on_bit+0x46/0x140
[   90.615343]  ? __pfx_bit_wait_io+0x10/0x10
[   90.615346]  out_of_line_wait_on_bit+0x95/0xc0
[   90.615349]  ? __pfx_wake_bit_function+0x10/0x10
[   90.615353]  do_get_write_access+0x266/0x410
[   90.615358]  jbd2_journal_get_write_access+0x5f/0x80
[   90.615362]  __ext4_journal_get_write_access+0x85/0x180
[   90.615366]  ? ext4_get_group_desc+0x5c/0xe0
[   90.615369]  ext4_mb_mark_diskspace_used+0x10d/0x460
[   90.615374]  ext4_mb_new_blocks+0x18b/0x1000
[   90.615377]  ? __kmalloc+0x4d/0x150
[   90.615381]  ? ext4_find_extent+0x3c6/0x420
[   90.615385]  ext4_ext_map_blocks+0x60d/0x1980
[   90.615389]  ? allocate_slab+0x251/0x4c0
[   90.615393]  ? ext4_alloc_io_end_vec+0x23/0x80
[   90.615397]  ? mpage_prepare_extent_to_map+0x454/0x490
[   90.615401]  ext4_map_blocks+0x251/0x620
[   90.615404]  ext4_do_writepages+0x76a/0xc90
[   90.615408]  ext4_writepages+0xad/0x180
[   90.615414]  do_writepages+0xd2/0x1e0
[   90.615418]  ? __wb_calc_thresh+0x4b/0x130
[   90.615421]  __writeback_single_inode+0x3d/0x360
[   90.615424]  ? wbc_detach_inode+0x101/0x220
[   90.615427]  writeback_sb_inodes+0x1ed/0x4b0
[   90.615432]  __writeback_inodes_wb+0x4c/0xf0
[   90.615435]  wb_writeback+0x172/0x2f0
[   90.615439]  wb_workfn+0x2b5/0x510
[   90.615442]  ? __schedule+0x3f3/0x14c0
[   90.615444]  ? __mod_timer+0x11f/0x370
[   90.615449]  process_one_work+0x1c7/0x3d0
[   90.615453]  worker_thread+0x51/0x390
[   90.615456]  ? __pfx_worker_thread+0x10/0x10
[   90.615458]  kthread+0xe8/0x120
[   90.615462]  ? __pfx_kthread+0x10/0x10
[   90.615466]  ret_from_fork+0x2c/0x50
[   90.615472]  </TASK>
[   90.615613] task:kworker/u64:4   state:D stack:0     pid:290   ppid:2      flags:0x00004000
[   90.615617] Workqueue: writeback wb_workfn (flush-253:20)
[   90.615621] Call Trace:
[   90.615622]  <TASK>
[   90.615623]  __schedule+0x3eb/0x14c0
[   90.615626]  ? __submit_bio+0x88/0x170
[   90.615629]  ? submit_bio_noacct_nocheck+0x159/0x370
[   90.615632]  schedule+0x5e/0xd0
[   90.615635]  md_super_wait+0xa2/0xe0
[   90.615640]  ? __pfx_autoremove_wake_function+0x10/0x10
[   90.615644]  md_bitmap_unplug+0xf3/0x120
[   90.615650]  flush_bio_list+0x1e/0xe0 [raid1]
[   90.615659]  raid1_unplug+0xaf/0xd0 [raid1]
[   90.615665]  __blk_flush_plug+0xdb/0x160
[   90.615669]  blk_finish_plug+0x29/0x40
[   90.615671]  wb_writeback+0x2bc/0x2f0
[   90.615674]  wb_workfn+0x2b5/0x510
[   90.615677]  ? ttwu_queue_wakelist+0xef/0x110
[   90.615681]  process_one_work+0x1c7/0x3d0
[   90.615684]  worker_thread+0x51/0x390
[   90.615687]  ? __pfx_worker_thread+0x10/0x10
[   90.615689]  kthread+0xe8/0x120
[   90.615692]  ? __pfx_kthread+0x10/0x10
[   90.615696]  ret_from_fork+0x2c/0x50
[   90.615701]  </TASK>
[   90.616150] task:mdX_raid1       state:D stack:0     pid:1425  ppid:2      flags:0x00004000
[   90.616153] Call Trace:
[   90.616154]  <TASK>
[   90.616155]  __schedule+0x3eb/0x14c0
[   90.616157]  ? __schedule+0x3f3/0x14c0
[   90.616161]  schedule+0x5e/0xd0
[   90.616163]  md_super_wait+0xa2/0xe0
[   90.616167]  ? __pfx_autoremove_wake_function+0x10/0x10
[   90.616171]  md_bitmap_daemon_work+0x211/0x3a0
[   90.616175]  ? __pfx_md_thread+0x10/0x10
[   90.616179]  md_check_recovery+0x42/0x5a0
[   90.616182]  raid1d+0x87/0x16d0 [raid1]
[   90.616188]  ? __pfx_md_thread+0x10/0x10
[   90.616191]  ? __schedule+0x3f3/0x14c0
[   90.616193]  ? psi_task_switch+0x84/0x230
[   90.616198]  ? lock_timer_base+0x61/0x80
[   90.616202]  ? prepare_to_wait_event+0x60/0x180
[   90.616206]  ? __pfx_md_thread+0x10/0x10
[   90.616209]  md_thread+0xae/0x190
[   90.616212]  ? __pfx_autoremove_wake_function+0x10/0x10
[   90.616217]  kthread+0xe8/0x120
[   90.616220]  ? __pfx_kthread+0x10/0x10
[   90.616223]  ret_from_fork+0x2c/0x50
[   90.616228]  </TASK>
[   90.616310] task:jbd2/dm-20-8    state:D stack:0     pid:1501  ppid:2      flags:0x00004000
[   90.616313] Call Trace:
[   90.616314]  <TASK>
[   90.616315]  __schedule+0x3eb/0x14c0
[   90.616317]  ? mempool_alloc+0x89/0x1b0
[   90.616322]  schedule+0x5e/0xd0
[   90.616324]  md_super_wait+0xa2/0xe0
[   90.616328]  ? __pfx_autoremove_wake_function+0x10/0x10
[   90.616332]  md_bitmap_unplug+0xba/0x120
[   90.616336]  flush_bio_list+0x1e/0xe0 [raid1]
[   90.616342]  raid1_unplug+0xaf/0xd0 [raid1]
[   90.616349]  __blk_flush_plug+0xdb/0x160
[   90.616352]  blk_finish_plug+0x29/0x40
[   90.616354]  jbd2_journal_commit_transaction+0xfdb/0x1a00
[   90.616362]  kjournald2+0xad/0x280
[   90.616367]  ? __pfx_autoremove_wake_function+0x10/0x10
[   90.616372]  ? __pfx_kjournald2+0x10/0x10
[   90.616375]  kthread+0xe8/0x120
[   90.616378]  ? __pfx_kthread+0x10/0x10
[   90.616382]  ret_from_fork+0x2c/0x50
[   90.616387]  </TASK>
[   90.618244] task:oggdemux0:sink  state:D stack:0     pid:3216  ppid:2103   flags:0x00000002
[   90.618248] Call Trace:
[   90.618249]  <TASK>
[   90.618251]  __schedule+0x3eb/0x14c0
[   90.618254]  schedule+0x5e/0xd0
[   90.618256]  io_schedule+0x46/0x70
[   90.618259]  folio_wait_bit_common+0x13d/0x350
[   90.618262]  ? __pfx_wake_page_function+0x10/0x10
[   90.618265]  filemap_get_pages+0x5fe/0x630
[   90.618268]  ? psi_task_switch+0xd6/0x230
[   90.618272]  ? finish_task_switch.isra.0+0x94/0x2f0
[   90.618276]  filemap_read+0xdf/0x350
[   90.618279]  ? dput+0x3a/0x310
[   90.618284]  vfs_read+0x1f6/0x320
[   90.618289]  ksys_read+0x6f/0xf0
[   90.618291]  do_syscall_64+0x60/0x90
[   90.618295]  ? exit_to_user_mode_prepare+0x142/0x1f0
[   90.618298]  ? syscall_exit_to_user_mode+0x1b/0x40
[   90.618302]  ? do_syscall_64+0x6c/0x90
[   90.618305]  ? exit_to_user_mode_prepare+0x142/0x1f0
[   90.618307]  ? syscall_exit_to_user_mode+0x1b/0x40
[   90.618310]  ? do_syscall_64+0x6c/0x90
[   90.618312]  ? do_syscall_64+0x6c/0x90
[   90.618314]  ? do_syscall_64+0x6c/0x90
[   90.618316]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   90.618321] RIP: 0033:0x7f2b3f9083ba
[   90.618346] RSP: 002b:00007f2b179fe180 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   90.618349] RAX: ffffffffffffffda RBX: 0000559e098dfd20 RCX: 00007f2b3f9083ba
[   90.618351] RDX: 00000000000025ae RSI: 00007f2b04038190 RDI: 0000000000000015
[   90.618352] RBP: 00007f2b179fe1a0 R08: 0000000000000000 R09: 00007f2b3f1e16a0
[   90.618354] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000025ae
[   90.618355] R13: 0000000000000000 R14: 00007f2b179ff648 R15: 00000000000025ae
[   90.618358]  </TASK>
[   90.618426] task:gcc             state:D stack:0     pid:6486  ppid:6485   flags:0x00000006
[   90.618429] Call Trace:
[   90.618430]  <TASK>
[   90.618431]  __schedule+0x3eb/0x14c0
[   90.618435]  schedule+0x5e/0xd0
[   90.618437]  io_schedule+0x46/0x70
[   90.618440]  folio_wait_bit_common+0x13d/0x350
[   90.618443]  ? __pfx_wake_page_function+0x10/0x10
[   90.618445]  filemap_get_pages+0x5fe/0x630
[   90.618449]  ? path_openat+0xa2c/0x1140
[   90.618454]  filemap_read+0xdf/0x350
[   90.618457]  ? do_filp_open+0xb3/0x160
[   90.618461]  vfs_read+0x1f6/0x320
[   90.618465]  ksys_read+0x6f/0xf0
[   90.618467]  do_syscall_64+0x60/0x90
[   90.618469]  ? exc_page_fault+0x7f/0x180
[   90.618472]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[   90.618476] RIP: 0033:0x7efee8b08381
[   90.618480] RSP: 002b:00007ffc7e238648 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[   90.618482] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efee8b08381
[   90.618483] RDX: 0000000000000400 RSI: 000055b8799377e0 RDI: 0000000000000005
[   90.618484] RBP: 00007ffc7e238770 R08: 0000000000000400 R09: 00007efee88442c0
[   90.618486] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000400
[   90.618487] R13: 00007ffc7e238810 R14: 00007ffc7e238690 R15: 00007ffc7e238680
[   90.618490]  </TASK>
Dave

quoted
quoted
What about the underying block dvices used by the md device?  If the
md device is attached to HDD's where you can see the activity light,
can you see (or hear) any disk activity?
It's spinning rust, and I hear them go quiet when the hang happens.

Dave
quoted
This sure seems like either the I/O driver isn't processing requests,
or some kind of hang in the md layer....

				- Ted
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help