Re: [syzbot] Monthly lsm report (Sep 2024)

From: Kent Overstreet <kent.overstreet@linux.dev>
Date: 2024-09-28 01:25:43
Also in: lkml

On Fri, Sep 27, 2024 at 08:08:48PM GMT, Kent Overstreet wrote:

On Fri, Sep 27, 2024 at 02:18:10PM GMT, Roberto Sassu wrote:

quoted

On Tue, 2024-09-24 at 13:53 +0200, Roberto Sassu wrote:

quoted

On Mon, 2024-09-23 at 08:06 -0400, Paul Moore wrote:

quoted

On Mon, Sep 23, 2024 at 5:02 AM syzbot
[off-list ref] wrote:

quoted

Hello lsm maintainers/developers,

This is a 31-day syzbot report for the lsm subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/lsm

During the period, 0 new issues were detected and 0 were fixed.
In total, 4 issues are still open and 27 have been fixed so far.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 306     No    INFO: task hung in process_measurement (2)
                  https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330

Mimi, Roberto,

Any chance this is this related in any way to this report:

https://lore.kernel.org/linux-security-module/CALAgD-4hkHVcCq2ycdwnA2hYDBMqijLUOfZgvf1WfFpU-8+42w@mail.gmail.com/ (local)

I reproduced the last, but I got a different result (the kernel crashed
in a different place).

It seems a corruption case, while the former looks more a lock
inversion issue. Will check more.

+ Kent Overstreet

https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330

It happens few times per day, since commit 4a39ac5b7d62 (which is
followed by a lot of merges). The bug has been likely introduced there.

In all recent reports, I noticed that there is always the following
lock sequence:

[  291.584319][   T30] 5 locks held by syz.0.75/5970:
[  291.594487][   T30]  #0: ffff888064066420 (sb_writers#25){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90
[  291.603984][   T30]  #1: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: do_truncate+0x20c/0x310
[  291.614497][   T30]  #2: ffff888054700a38 (&c->snapshot_create_lock){.+.+}-{3:3}, at: bch2_truncate+0x16d/0x2c0
[  291.624871][   T30]  #3: ffff888054704398 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x7de/0xd20
[  291.635446][   T30]  #4: ffff8880547266d0 (&c->gc_lock){.+.+}-{3:3}, at: bch2_btree_update_start+0x682/0x14e0

IMA is stuck too, since it is waiting for the inode lock to be released:

[  291.645689][   T30] 1 lock held by syz.0.75/6010:
[  291.650622][   T30]  #0: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: process_measurement+0x439/0x1fb0

It seems that the super block is locked by someone else, which is not
able to unlock. Maybe, it is related to bch2_journal_reclaim_thread(),
but I don't know for sure.

Kent, do you have time to look at this report?

If you scroll back in the console log, you see this:

[  230.372988][ T6772] Allocator stuck? Waited for 30 seconds
[  230.373025][ T6772] Allocator debug:
[  230.373039][ T6772]   capacity1536
[  230.373051][ T6772]   reserved             31232
[  230.373064][ T6772]   hidden               0
[  230.373077][ T6772]   btree                0
[  230.373090][ T6772]   data                 0
[  230.373102][ T6772]   cached               0
[  230.373114][ T6772]   reserved             0
[  230.373127][ T6772]   online_reserved      768
[  230.373140][ T6772]   nr_inodes            0
[  230.373152][ T6772]   
[  230.373164][ T6772]   freelist_wait        waiting
[  230.373176][ T6772]   open buckets allocated1
[  230.373189][ T6772]   open buckets total   1024
[  230.373202][ T6772]   open_buckets_wait    empty
[  230.373214][ T6772]   open_buckets_btree   0
[  230.373227][ T6772]   open_buckets_user    0
[  230.373239][ T6772]   btree reserve cache  0
[  230.373251][ T6772] 
[  230.373262][ T6772] Dev 0:
[  230.373274][ T6772]                      buckets         sectors      fragmented
[  230.373288][ T6772]   free                     0               0               0
[  230.373303][ T6772]   sb                       0               0               0
[  230.373318][ T6772]   journal                  0               0               0
[  230.373332][ T6772]   btree                    0               0               0
[  230.373347][ T6772]   user                     0               0               0
[  230.373361][ T6772]   cached                   0               0               0
[  230.373375][ T6772]   parity                   0               0               0
[  230.373390][ T6772]   stripe                   0               0               0
[  230.373404][ T6772]   need_gc_gens             0               0               0
[  230.373418][ T6772]   need_discard             0               0               0
[  230.373432][ T6772]   unstriped                0               0               0
[  230.373446][ T6772]   capacity               128
[  230.373459][ T6772]   
[  230.373470][ T6772]   reserves:
[  230.373481][ T6772]   stripe                  60
[  230.373494][ T6772]   normal                  58
[  230.373506][ T6772]   copygc                  56
[  230.373519][ T6772]   btree                   28
[  230.373531][ T6772]   btree_copygc             0
[  230.373543][ T6772]   reclaim                  0
[  230.373556][ T6772]   interior_updates         0
[  230.373569][ T6772]   
[  230.373580][ T6772]   open buckets             0
[  230.373592][ T6772]   buckets to invalidate    0
[  230.373605][ T6772] 
[  230.373616][ T6772] Copygc debug:
[  230.373627][ T6772]   running: 1
[  230.373656][ T6772]   copygc_wait:0
[  230.373675][ T6772]   copygc_wait_at:0
[  230.373694][ T6772]   Currently waiting for:0 B
[  230.373708][ T6772]   Currently waiting since:644 KiB
[  230.373722][ T6772]   Currently calculated wait:0 B
[  230.373735][ T6772]

this looks like a bug in bcachefs where alloc counters didn't get
initialized correctly, which makes sense given that 6.11 landed the disk
accounting rewrite. Will dig more as I have time.

And looking further, I don't see anyhting in the console log from when
bcachefs actually mounted (???), which means I don't think I have enough
to go on. It's clearly an upgrade path issue - we didn't run
check_allocations as is required when upgrading to 1.11 - but it's not
reproducing for me when I run tests with old tools.

Can we get some more information about the syzbot reproducer? Exact
tools version, format command and mount command.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help