Thread (12 messages) 12 messages, 6 authors, 2024-09-28

Re: [syzbot] Monthly lsm report (Sep 2024)

From: Kent Overstreet <kent.overstreet@linux.dev>
Date: 2024-09-28 01:25:43
Also in: lkml

On Fri, Sep 27, 2024 at 08:08:48PM GMT, Kent Overstreet wrote:
On Fri, Sep 27, 2024 at 02:18:10PM GMT, Roberto Sassu wrote:
quoted
On Tue, 2024-09-24 at 13:53 +0200, Roberto Sassu wrote:
quoted
On Mon, 2024-09-23 at 08:06 -0400, Paul Moore wrote:
quoted
On Mon, Sep 23, 2024 at 5:02 AM syzbot
[off-list ref] wrote:
quoted
Hello lsm maintainers/developers,

This is a 31-day syzbot report for the lsm subsystem.
All related reports/information can be found at:
https://syzkaller.appspot.com/upstream/s/lsm

During the period, 0 new issues were detected and 0 were fixed.
In total, 4 issues are still open and 27 have been fixed so far.

Some of the still happening issues:

Ref Crashes Repro Title
<1> 306     No    INFO: task hung in process_measurement (2)
                  https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330
Mimi, Roberto,

Any chance this is this related in any way to this report:

https://lore.kernel.org/linux-security-module/CALAgD-4hkHVcCq2ycdwnA2hYDBMqijLUOfZgvf1WfFpU-8+42w@mail.gmail.com/ (local)
I reproduced the last, but I got a different result (the kernel crashed
in a different place).

It seems a corruption case, while the former looks more a lock
inversion issue. Will check more.
+ Kent Overstreet

https://syzkaller.appspot.com/bug?extid=1de5a37cb85a2d536330

It happens few times per day, since commit 4a39ac5b7d62 (which is
followed by a lot of merges). The bug has been likely introduced there.

In all recent reports, I noticed that there is always the following
lock sequence:

[  291.584319][   T30] 5 locks held by syz.0.75/5970:
[  291.594487][   T30]  #0: ffff888064066420 (sb_writers#25){.+.+}-{0:0}, at: mnt_want_write+0x3f/0x90
[  291.603984][   T30]  #1: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: do_truncate+0x20c/0x310
[  291.614497][   T30]  #2: ffff888054700a38 (&c->snapshot_create_lock){.+.+}-{3:3}, at: bch2_truncate+0x16d/0x2c0
[  291.624871][   T30]  #3: ffff888054704398 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x7de/0xd20
[  291.635446][   T30]  #4: ffff8880547266d0 (&c->gc_lock){.+.+}-{3:3}, at: bch2_btree_update_start+0x682/0x14e0

IMA is stuck too, since it is waiting for the inode lock to be released:

[  291.645689][   T30] 1 lock held by syz.0.75/6010:
[  291.650622][   T30]  #0: ffff88805d8b0148 (&sb->s_type->i_mutex_key#30){++++}-{3:3}, at: process_measurement+0x439/0x1fb0

It seems that the super block is locked by someone else, which is not
able to unlock. Maybe, it is related to bch2_journal_reclaim_thread(),
but I don't know for sure.

Kent, do you have time to look at this report?
If you scroll back in the console log, you see this:

[  230.372988][ T6772] Allocator stuck? Waited for 30 seconds
[  230.373025][ T6772] Allocator debug:
[  230.373039][ T6772]   capacity1536
[  230.373051][ T6772]   reserved             31232
[  230.373064][ T6772]   hidden               0
[  230.373077][ T6772]   btree                0
[  230.373090][ T6772]   data                 0
[  230.373102][ T6772]   cached               0
[  230.373114][ T6772]   reserved             0
[  230.373127][ T6772]   online_reserved      768
[  230.373140][ T6772]   nr_inodes            0
[  230.373152][ T6772]   
[  230.373164][ T6772]   freelist_wait        waiting
[  230.373176][ T6772]   open buckets allocated1
[  230.373189][ T6772]   open buckets total   1024
[  230.373202][ T6772]   open_buckets_wait    empty
[  230.373214][ T6772]   open_buckets_btree   0
[  230.373227][ T6772]   open_buckets_user    0
[  230.373239][ T6772]   btree reserve cache  0
[  230.373251][ T6772] 
[  230.373262][ T6772] Dev 0:
[  230.373274][ T6772]                      buckets         sectors      fragmented
[  230.373288][ T6772]   free                     0               0               0
[  230.373303][ T6772]   sb                       0               0               0
[  230.373318][ T6772]   journal                  0               0               0
[  230.373332][ T6772]   btree                    0               0               0
[  230.373347][ T6772]   user                     0               0               0
[  230.373361][ T6772]   cached                   0               0               0
[  230.373375][ T6772]   parity                   0               0               0
[  230.373390][ T6772]   stripe                   0               0               0
[  230.373404][ T6772]   need_gc_gens             0               0               0
[  230.373418][ T6772]   need_discard             0               0               0
[  230.373432][ T6772]   unstriped                0               0               0
[  230.373446][ T6772]   capacity               128
[  230.373459][ T6772]   
[  230.373470][ T6772]   reserves:
[  230.373481][ T6772]   stripe                  60
[  230.373494][ T6772]   normal                  58
[  230.373506][ T6772]   copygc                  56
[  230.373519][ T6772]   btree                   28
[  230.373531][ T6772]   btree_copygc             0
[  230.373543][ T6772]   reclaim                  0
[  230.373556][ T6772]   interior_updates         0
[  230.373569][ T6772]   
[  230.373580][ T6772]   open buckets             0
[  230.373592][ T6772]   buckets to invalidate    0
[  230.373605][ T6772] 
[  230.373616][ T6772] Copygc debug:
[  230.373627][ T6772]   running: 1
[  230.373656][ T6772]   copygc_wait:0
[  230.373675][ T6772]   copygc_wait_at:0
[  230.373694][ T6772]   Currently waiting for:0 B
[  230.373708][ T6772]   Currently waiting since:644 KiB
[  230.373722][ T6772]   Currently calculated wait:0 B
[  230.373735][ T6772]

this looks like a bug in bcachefs where alloc counters didn't get
initialized correctly, which makes sense given that 6.11 landed the disk
accounting rewrite. Will dig more as I have time.
And looking further, I don't see anyhting in the console log from when
bcachefs actually mounted (???), which means I don't think I have enough
to go on. It's clearly an upgrade path issue - we didn't run
check_allocations as is required when upgrading to 1.11 - but it's not
reproducing for me when I run tests with old tools.

Can we get some more information about the syzbot reproducer? Exact
tools version, format command and mount command.
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help