Re: general protection fault in wb_workfn (2)
From: Dmitry Vyukov <dvyukov@google.com>
Date: 2018-06-08 17:15:00
Also in:
linux-fsdevel, lkml
On Fri, Jun 8, 2018 at 6:53 PM, Dmitry Vyukov [off-list ref] wrote:
On Fri, Jun 8, 2018 at 5:16 PM, Dmitry Vyukov [off-list ref] wrote:quoted
quoted
On Fri, Jun 8, 2018 at 4:31 AM, Tetsuo Handa [off-list ref] wrote:quoted
Dmitry Vyukov wrote:quoted
On Tue, Jun 5, 2018 at 3:45 PM, Tetsuo Handa [off-list ref] wrote:quoted
Dmitry, can you assign VM resources for a git tree for this bug? This bug wants to fight against https://github.com/google/syzkaller/blob/master/docs/syzbot.md#no-custom-patches ...Hi Tetsuo, Most of the reasons for not doing it still stand. A syzkaller instance will produce not just this bug, it will produce hundreds of different bugs. Then the question is: what to do with these bugs? Report all to mailing lists?Is it possible to add linux-next.git tree as a target for fuzzing? If yes, we can try debug patches easily, in addition to find bugs earlier than now.syzbot tested linux-next and mmotm initially, but they were removed at the request of kernel developers. See: https://groups.google.com/d/msg/syzkaller/0H0LHW_ayR8/dsK5qGB_AQAJ and: https://groups.google.com/d/msg/syzkaller-bugs/FeAgni6Atlk/U0JGoR0AAwAJ Indeed, linux-next produces around 50 assorted one-off unexplainable bug reports.quoted
quoted
I think the solution here is just to run syzkaller instance locally. It's just a program anybody can run it on any kernel with any custom patches. Moreover for local instance it's also possible to limit set of tested syscalls to increase probability of hitting this bug and at the same time filter out most of other bugs.If this bug is reproducible with VM resources individual developer can afford... Since my Linux development environment is VMware guests on a Windows PC, I can't run VM instance which needs KVM acceleration. Also, due to security policy, I can't utilize external VM resources available on the Internet, as well as I can't use ssh and git protocols. Speak of this bug, even with a lot of VM instances, syzbot can reproduce this bug only once or twice per a day. Thus, the question for me boils down to, whether I can reproduce this bug using one VMware guest instance with 4GB of memory. Effectively, I don't have access to environments for running syzkaller instance...Well, I don't know what to say, it does require some resources.quoted
quoted
Do we have any idea about the guilty subsystem? You mentioned bdi_unregister, why? What would be the set of syscalls to concentrate on? I will do a custom run when I get around to it, if nobody else beats me to it.Because bdi_unregister() does "bdi->dev = NULL;" which wb_workfn() is hitting NULL pointer dereference.Right, wb_workfn is not a generic function, it's fs-specific function. Trying to reproduce this locally now.No luck so far. Trying to look from a different angle: is it possible that bdi->dev is not set yet, rather then already reset?I was able to reproduce this once locally running syz-crush utility replaying one of the crash logs. Now running with Tetsuo's patch. I can say we hunting a very subtle race condition with short inconsistency window, perhaps few instructions.
Here we go: [ 2853.033175] WARNING: wb_workfn: device is NULL [ 2853.034709] wb->state=2 [ 2853.035486] (wb == &wb->bdi->wb)=0 [ 2853.036489] list_empty(&wb->work_list)=1 [ 2853.037603] list_empty(&wb->bdi->bdi_list)=0 [ 2853.038843] wb->bdi->wb.state=0 [ 2853.039819] (wb->congested->__bdi == wb->bdi)=1 [ 2853.041062] list_empty(&wb->congested->__bdi->bdi_list)=0 [ 2853.042609] wb->congested->__bdi->wb.state=0 [ 2853.043793] kasan: CONFIG_KASAN_INLINE enabled [ 2853.045315] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 2853.047376] general protection fault: 0000 [#1] SMP KASAN [ 2853.048980] CPU: 1 PID: 13971 Comm: kworker/u12:8 Not tainted 4.17.0+ #21 [ 2853.050762] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 [ 2853.053034] Workqueue: writeback wb_workfn [ 2853.054193] RIP: 0010:wb_workfn+0x187/0xab0 [ 2853.055360] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5 05 00 [ 2853.060692] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206 [ 2853.062210] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2853.064215] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050 [ 2853.066198] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0 [ 2853.068037] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead [ 2853.069970] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000 [ 2853.071932] FS: 0000000000000000(0000) GS:ffff88006c640000(0000) knlGS:0000000000000000 [ 2853.074080] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2853.075699] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0 [ 2853.077633] Call Trace: [ 2853.078341] ? inode_wait_for_writeback+0x40/0x40 [ 2853.079642] ? graph_lock+0x170/0x170 [ 2853.080710] ? lock_downgrade+0x8e0/0x8e0 [ 2853.081889] ? find_held_lock+0x36/0x1c0 [ 2853.083047] ? graph_lock+0x170/0x170 [ 2853.084105] ? lock_acquire+0x1dc/0x520 [ 2853.085216] ? process_one_work+0xb8c/0x1b70 [ 2853.086425] ? kasan_check_read+0x11/0x20 [ 2853.087608] ? __lock_is_held+0xb5/0x140 [ 2853.088720] process_one_work+0xc64/0x1b70 [ 2853.089875] ? finish_task_switch+0x182/0x840 [ 2853.091085] ? pwq_dec_nr_in_flight+0x490/0x490 [ 2853.092358] ? __schedule+0x809/0x1e30 [ 2853.093475] ? retint_kernel+0x10/0x10 [ 2853.094561] ? retint_kernel+0x10/0x10 [ 2853.095610] ? graph_lock+0x170/0x170 [ 2853.096623] ? trace_hardirqs_on_caller+0x421/0x5c0 [ 2853.097981] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 2853.099298] ? find_held_lock+0x36/0x1c0 [ 2853.100394] ? lock_acquire+0x1dc/0x520 [ 2853.101472] ? worker_thread+0x3d4/0x13a0 [ 2853.102558] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20 [ 2853.104060] ? move_linked_works+0x2f6/0x470 [ 2853.105204] ? trace_event_raw_event_workqueue_execute_start+0x290/0x290 [ 2853.107013] ? do_raw_spin_trylock+0x1b0/0x1b0 [ 2853.108243] worker_thread+0x9e5/0x13a0 [ 2853.109304] ? process_one_work+0x1b70/0x1b70 [ 2853.110488] ? graph_lock+0x170/0x170 [ 2853.111514] ? find_held_lock+0x36/0x1c0 [ 2853.112610] ? find_held_lock+0x36/0x1c0 [ 2853.113674] ? __schedule+0x1e30/0x1e30 [ 2853.114714] ? do_raw_spin_unlock+0x1f9/0x2e0 [ 2853.115885] ? do_raw_spin_trylock+0x1b0/0x1b0 [ 2853.117060] ? __sanitizer_cov_trace_const_cmp8+0x18/0x20 [ 2853.118492] ? __kthread_parkme+0x111/0x1d0 [ 2853.119616] ? parse_args.cold.15+0x1b3/0x1b3 [ 2853.120804] ? trace_hardirqs_on_caller+0x421/0x5c0 [ 2853.122071] ? trace_hardirqs_on+0xd/0x10 [ 2853.123117] kthread+0x345/0x410 [ 2853.123999] ? process_one_work+0x1b70/0x1b70 [ 2853.125174] ? kthread_bind+0x40/0x40 [ 2853.126201] ret_from_fork+0x3a/0x50 [ 2853.127199] Modules linked in: [ 2853.128022] Dumping ftrace buffer: [ 2853.128901] (ftrace buffer empty) [ 2853.129986] ---[ end trace 3ba28e076cb32fda ]--- [ 2853.131269] RIP: 0010:wb_workfn+0x187/0xab0 [ 2853.132441] Code: 85 70 fd ff ff 48 83 e8 10 48 89 85 60 fd ff ff e8 5e 38 ab ff 48 8d 7b 50 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 05 08 00 00 4c 8b 63 50 4d 85 e4 0f 84 b5 05 00 [ 2853.137449] RSP: 0018:ffff8800600ef480 EFLAGS: 00010206 [ 2853.138618] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2853.140176] RDX: 000000000000000a RSI: ffffffff81cf0312 RDI: 0000000000000050 [ 2853.141722] RBP: ffff8800600ef750 R08: ffff880061e30400 R09: ffffed000d8ccfc0 [ 2853.143300] R10: ffffed000d8ccfc0 R11: ffff88006c667e07 R12: 1ffff1000c01dead [ 2853.144841] R13: ffff8800600ef728 R14: ffff8800676bd180 R15: dffffc0000000000 [ 2853.146391] FS: 0000000000000000(0000) GS:ffff88006c640000(0000) knlGS:0000000000000000 [ 2853.148141] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2853.149406] CR2: 00007ffc8478c000 CR3: 0000000008c6a006 CR4: 00000000001626e0 [ 2853.150968] Kernel panic - not syncing: Fatal exception [ 2853.152419] Dumping ftrace buffer: [ 2853.153121] (ftrace buffer empty) [ 2853.153786] Kernel Offset: disabled [ 2853.154442] Rebooting in 86400 seconds..