Re: [syzbot] Monthly lsm report (Sep 2024)
From: Kent Overstreet <kent.overstreet@linux.dev>
Date: 2024-09-28 01:25:43
Also in:
lkml
On Fri, Sep 27, 2024 at 08:08:48PM GMT, Kent Overstreet wrote:
On Fri, Sep 27, 2024 at 02:18:10PM GMT, Roberto Sassu wrote:quoted
On Tue, 2024-09-24 at 13:53 +0200, Roberto Sassu wrote:quoted
On Mon, 2024-09-23 at 08:06 -0400, Paul Moore wrote:quoted
On Mon, Sep 23, 2024 at 5:02 AM syzbot [off-list ref] wrote:quoted
Hello lsm maintainers/developers, This is a 31-day syzbot report for the lsm subsystem. All related reports/information can be found at: https://syzkaller.appspot.com/upstream/s/lsm During the period, 0 new issues were detected and 0 were fixed. In total, 4 issues are still open and 27 have been fixed so far. Some of the still happening issues: Ref Crashes Repro Title <1> 306 No INFO: task hung in process_measurement (2) https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330Mimi, Roberto, Any chance this is this related in any way to this report: https://lore.kernel.org/linux-security-module/CALAgD-4hkHVcCq2ycdwnA2hYDBMqijLUOfZgvf1WfFpU-8+42w@mail.gmail.com/ (local)I reproduced the last, but I got a different result (the kernel crashed in a different place). It seems a corruption case, while the former looks more a lock inversion issue. Will check more.+ Kent Overstreet https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330 It happens few times per day, since commit 4a39ac5b7d62 (which is followed by a lot of merges). The bug has been likely introduced there. In all recent reports, I noticed that there is always the following lock sequence: [ 291.584319][ T30] 5 locks held by syz.0.75/5970: [ 291.594487][ T30] #0: ffff888064066420 (sb_writers#25){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90 [ 291.603984][ T30] #1: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: do_truncate+0x20c/0x310 [ 291.614497][ T30] #2: ffff888054700a38 (&c->snapshot_create_lock){.+.+}-{3:3}, at: bch2_truncate+0x16d/0x2c0 [ 291.624871][ T30] #3: ffff888054704398 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x7de/0xd20 [ 291.635446][ T30] #4: ffff8880547266d0 (&c->gc_lock){.+.+}-{3:3}, at: bch2_btree_update_start+0x682/0x14e0 IMA is stuck too, since it is waiting for the inode lock to be released: [ 291.645689][ T30] 1 lock held by syz.0.75/6010: [ 291.650622][ T30] #0: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: process_measurement+0x439/0x1fb0 It seems that the super block is locked by someone else, which is not able to unlock. Maybe, it is related to bch2_journal_reclaim_thread(), but I don't know for sure. Kent, do you have time to look at this report?If you scroll back in the console log, you see this: [ 230.372988][ T6772] Allocator stuck? Waited for 30 seconds [ 230.373025][ T6772] Allocator debug: [ 230.373039][ T6772] capacity1536 [ 230.373051][ T6772] reserved 31232 [ 230.373064][ T6772] hidden 0 [ 230.373077][ T6772] btree 0 [ 230.373090][ T6772] data 0 [ 230.373102][ T6772] cached 0 [ 230.373114][ T6772] reserved 0 [ 230.373127][ T6772] online_reserved 768 [ 230.373140][ T6772] nr_inodes 0 [ 230.373152][ T6772] [ 230.373164][ T6772] freelist_wait waiting [ 230.373176][ T6772] open buckets allocated1 [ 230.373189][ T6772] open buckets total 1024 [ 230.373202][ T6772] open_buckets_wait empty [ 230.373214][ T6772] open_buckets_btree 0 [ 230.373227][ T6772] open_buckets_user 0 [ 230.373239][ T6772] btree reserve cache 0 [ 230.373251][ T6772] [ 230.373262][ T6772] Dev 0: [ 230.373274][ T6772] buckets sectors fragmented [ 230.373288][ T6772] free 0 0 0 [ 230.373303][ T6772] sb 0 0 0 [ 230.373318][ T6772] journal 0 0 0 [ 230.373332][ T6772] btree 0 0 0 [ 230.373347][ T6772] user 0 0 0 [ 230.373361][ T6772] cached 0 0 0 [ 230.373375][ T6772] parity 0 0 0 [ 230.373390][ T6772] stripe 0 0 0 [ 230.373404][ T6772] need_gc_gens 0 0 0 [ 230.373418][ T6772] need_discard 0 0 0 [ 230.373432][ T6772] unstriped 0 0 0 [ 230.373446][ T6772] capacity 128 [ 230.373459][ T6772] [ 230.373470][ T6772] reserves: [ 230.373481][ T6772] stripe 60 [ 230.373494][ T6772] normal 58 [ 230.373506][ T6772] copygc 56 [ 230.373519][ T6772] btree 28 [ 230.373531][ T6772] btree_copygc 0 [ 230.373543][ T6772] reclaim 0 [ 230.373556][ T6772] interior_updates 0 [ 230.373569][ T6772] [ 230.373580][ T6772] open buckets 0 [ 230.373592][ T6772] buckets to invalidate 0 [ 230.373605][ T6772] [ 230.373616][ T6772] Copygc debug: [ 230.373627][ T6772] running: 1 [ 230.373656][ T6772] copygc_wait:0 [ 230.373675][ T6772] copygc_wait_at:0 [ 230.373694][ T6772] Currently waiting for:0 B [ 230.373708][ T6772] Currently waiting since:644 KiB [ 230.373722][ T6772] Currently calculated wait:0 B [ 230.373735][ T6772] this looks like a bug in bcachefs where alloc counters didn't get initialized correctly, which makes sense given that 6.11 landed the disk accounting rewrite. Will dig more as I have time.
And looking further, I don't see anyhting in the console log from when bcachefs actually mounted (???), which means I don't think I have enough to go on. It's clearly an upgrade path issue - we didn't run check_allocations as is required when upgrading to 1.11 - but it's not reproducing for me when I run tests with old tools. Can we get some more information about the syzbot reproducer? Exact tools version, format command and mount command.