Re: [PATCH 0/13 v2] block: Fix block device shutdown related races

From: Omar Sandoval <osandov@osandov.com>
Date: 2017-03-06 20:38:18

On Wed, Mar 01, 2017 at 04:11:09PM +0100, Jan Kara wrote:

On Tue 28-02-17 23:26:53, Omar Sandoval wrote:

quoted

On Tue, Feb 28, 2017 at 11:25:28PM -0800, Omar Sandoval wrote:

quoted

On Wed, Feb 22, 2017 at 11:24:25AM +0100, Jan Kara wrote:

quoted

On Tue 21-02-17 10:19:28, Jens Axboe wrote:

quoted

On 02/21/2017 10:09 AM, Jan Kara wrote:

quoted

Hello,

this is a second revision of the patch set to fix several different races and
issues I've found when testing device shutdown and reuse. The first three
patches are fixes to problems in my previous series fixing BDI lifetime issues.
Patch 4 fixes issues with reuse of BDI name with scsi devices. With it I cannot
reproduce the BDI name reuse issues using Omar's stress test using scsi_debug
so it can be used as a replacement of Dan's patches. Patches 5-11 fix oops that
is triggered by __blkdev_put() calling inode_detach_wb() too early (the problem
reported by Thiago). Patches 12 and 13 fix oops due to a bug in gendisk code
where get_gendisk() can return already freed gendisk structure (again triggered
by Omar's stress test).

People, please have a look at patches. They are mostly simple however the
interactions are rather complex so I may have missed something. Also I'm
happy for any additional testing these patches can get - I've stressed them
with Omar's script, tested memcg writeback, tested static (not udev managed)
device inodes.

Jens, I think at least patches 1-3 should go in together with my fixes you
already have in your tree (or shortly after them). It is up to you whether
you decide to delay my first fixes or pick these up quickly. Patch 4 is
(IMHO a cleaner) replacement of Dan's patches so consider whether you want
to use it instead of those patches.

I have applied 1-3 to my for-linus branch, which will go in after
the initial pull request has been pulled by Linus. Consider fixing up
#4 so it applies, I like it.

OK, attached is patch 4 rebased on top of Linus' tree from today which
already has linux-block changes pulled in. I've left put_disk_devt() in
blk_cleanup_queue() to maintain the logic in the original patch (now commit
0dba1314d4f8) that request_queue and gendisk each hold one devt reference.
The bdi_unregister() call that is moved to del_gendisk() by this patch is
now protected by the gendisk reference instead of the request_queue one
so it still maintains the property that devt reference protects bdi
registration-unregistration lifetime (as much as that is not needed anymore
after this patch).

I have also updated the comment in the code and the changelog - they were
somewhat stale after changes to the whole series Tejun suggested.

								Honza

Hey, Jan, I just tested this out when I was seeing similar crashes with
sr instead of sd, and this fixed it.

Tested-by: Omar Sandoval <redacted>

Just realized it wasn't clear, I'm talking about patch 4 specifically.

Thanks for confirmation!

								Honza

Unfortunately, this patch actually seems to have introduced a
regression. Reverting the patch fixes it.

I'm running a QEMU kvm vm with a virtio-scsi device and I get oopses
like this:

[    6.741773] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[    6.744401] IP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
[    6.744985] PGD 0
[    6.744985]
[    6.744985] Oops: 0002 [#1] PREEMPT SMP
[    6.744985] Modules linked in: virtio_scsi scsi_mod virtio_net
[    6.744985] CPU: 3 PID: 5 Comm: kworker/u8:0 Not tainted 4.11.0-rc1 #36
[    6.744985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[    6.744985] Workqueue: events_unbound async_run_entry_fn
[    6.744985] task: ffff88007c8672c0 task.stack: ffffc9000033c000
[    6.750038] RIP: 0010:virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi]
[    6.750038] RSP: 0018:ffffc9000033f8b0 EFLAGS: 00010246
[    6.750038] RAX: ffff88007c98d000 RBX: ffff880078c814c8 RCX: 0000000000000000
[    6.750038] RDX: ffff880078c814c8 RSI: ffff880078c814c8 RDI: ffff88007c98d000
[    6.750038] RBP: ffffc9000033f8b0 R08: 0000000000000000 R09: 0000000000000400
[    6.750038] R10: ffff88007b1fe748 R11: 0000000000000024 R12: ffff880078c814c8
[    6.750038] R13: ffff88007c98d000 R14: ffff880078c81380 R15: ffff88007b532000
[    6.750038] FS:  0000000000000000(0000) GS:ffff88007fd80000(0000) knlGS:0000000000000000
[    6.750038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.750038] CR2: 0000000000000004 CR3: 0000000001c09000 CR4: 00000000000006e0
[    6.750038] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[    6.750038] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[    6.750038] Call Trace:
[    6.750038]  scsi_dispatch_cmd+0xa3/0x1d0 [scsi_mod]
[    6.750038]  scsi_queue_rq+0x586/0x740 [scsi_mod]
[    6.750038]  blk_mq_dispatch_rq_list+0x22a/0x420
[    6.750038]  blk_mq_sched_dispatch_requests+0x142/0x1b0
[    6.750038]  __blk_mq_run_hw_queue+0x94/0xb0
[    6.750038]  blk_mq_run_hw_queue+0x82/0xa0
[    6.750038]  blk_mq_sched_insert_request+0x89/0x1a0
[    6.750038]  ? blk_execute_rq+0xc0/0xc0
[    6.750038]  blk_execute_rq_nowait+0x9a/0x140
[    6.750038]  ? bio_phys_segments+0x19/0x20
[    6.750038]  blk_execute_rq+0x56/0xc0
[    6.750038]  scsi_execute+0x128/0x270 [scsi_mod]
[    6.750038]  scsi_probe_and_add_lun+0x247/0xc60 [scsi_mod]
[    6.750038]  ? __pm_runtime_resume+0x4c/0x60
[    6.750038]  __scsi_scan_target+0x102/0x520 [scsi_mod]
[    6.750038]  ? account_entity_dequeue+0xab/0xe0
[    6.750038]  scsi_scan_channel+0x81/0xa0 [scsi_mod]
[    6.750038]  scsi_scan_host_selected+0xcc/0x100 [scsi_mod]
[    6.750038]  do_scsi_scan_host+0x8d/0x90 [scsi_mod]
[    6.750038]  do_scan_async+0x1c/0x1a0 [scsi_mod]
[    6.750038]  async_run_entry_fn+0x4a/0x170
[    6.750038]  process_one_work+0x165/0x430
[    6.750038]  worker_thread+0x4e/0x490
[    6.750038]  kthread+0x101/0x140
[    6.750038]  ? process_one_work+0x430/0x430
[    6.750038]  ? kthread_create_on_node+0x60/0x60
[    6.750038]  ret_from_fork+0x2c/0x40
[    6.750038] Code: ff ff 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 4e 30 48 89 f8 48 89 f2 48 89 e5 48 8b 89 88 01 00 00 48 8b 89 08 03 00 00 <f0> ff 41 04 48 8d bf c0 07 00 00 48 8d b0 c8 09 00 00 e8 08 fd
[    6.750038] RIP: virtscsi_queuecommand_single+0x21/0x40 [virtio_scsi] RSP: ffffc9000033f8b0
[    6.750038] CR2: 0000000000000004
[    6.750038] ---[ end trace cee5df55872abfa0 ]---
[    6.750038] note: kworker/u8:0[5] exited with preempt_count 1

Arch Linux 4.11.0-rc1 (ttyS0)

silver login: [   27.370267] scsi 0:0:53:0: tag#766 abort
[   27.374501] scsi 0:0:53:0: device reset
[   27.378539] scsi 0:0:53:0: Device offlined - not ready after error recovery

That crash is because tgt is NULL here:

----
static int virtscsi_queuecommand_single(struct Scsi_Host *sh,
					struct scsi_cmnd *sc)
{
	struct virtio_scsi *vscsi = shost_priv(sh);
	struct virtio_scsi_target_state *tgt =
				scsi_target(sc->device)->hostdata;

======> atomic_inc(&tgt->reqs);
	return virtscsi_queuecommand(vscsi, &vscsi->req_vqs[0], sc);
}
----

Applying the rest of this patch set didn't fix it. Additionally,
enabling KASAN gives this use-after-free spew:
----
[    7.789280] SCSI subsystem initialized
[    7.809991] virtio_net virtio0 ens2: renamed from eth0
[    7.828301] scsi host0: Virtio SCSI HBA
[    7.916214] scsi 0:0:0:0: Direct-Access     QEMU     QEMU HARDDISK    2.5+ PQ: 0 ANSI: 5
[    7.946713] ==================================================================
[    7.947609] BUG: KASAN: use-after-free in rb_erase+0x13a2/0x1a20 at addr ffff88006a915e30
[    7.948583] Write of size 8 by task dhcpcd-run-hook/146
[    7.949202] CPU: 0 PID: 146 Comm: dhcpcd-run-hook Not tainted 4.11.0-rc1-00011-gc35ccd2d43e9 #42
[    7.950021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[    7.950021] Call Trace:
[    7.950021]  <IRQ>
[    7.950021]  dump_stack+0x63/0x86
[    7.950021]  kasan_object_err+0x21/0x70
[    7.950021]  kasan_report.part.1+0x23a/0x520
[    7.950021]  ? rb_erase+0x13a2/0x1a20
[    7.950021]  ? swake_up_locked+0x94/0x1c0
[    7.950021]  __asan_report_store8_noabort+0x31/0x40
[    7.950021]  rb_erase+0x13a2/0x1a20
[    7.950021]  wb_congested_put+0x6a/0xd0
[    7.950021]  __blkg_release_rcu+0x16b/0x230
[    7.950021]  rcu_process_callbacks+0x602/0xfc0
[    7.950021]  __do_softirq+0x14e/0x5b3
[    7.950021]  irq_exit+0x14e/0x180
[    7.950021]  smp_apic_timer_interrupt+0x7b/0xa0
[    7.950021]  apic_timer_interrupt+0x89/0x90
[    7.950021] RIP: 0010:copy_strings.isra.7+0x312/0x6b0
[    7.950021] RSP: 0018:ffff880063bffcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
[    7.950021] RAX: 0000000000002000 RBX: 0000000000000fe3 RCX: 0000000000000002
[    7.950021] RDX: ffff88006a734880 RSI: 0000000000000002 RDI: ffff880069750808
[    7.950021] RBP: ffff880063bffdd0 R08: 8000000063ec3067 R09: 0000000000000040
[    7.950021] R10: ffffed000c7d8600 R11: ffffffff82afa6a0 R12: 00007fffffffefe3
[    7.950021] R13: 0000000000000015 R14: ffff880069750780 R15: dffffc0000000000
[    7.950021]  </IRQ>
[    7.950021]  ? set_binfmt+0x120/0x120
[    7.950021]  ? insert_vm_struct+0x148/0x2e0
[    7.950021]  ? kasan_slab_alloc+0x12/0x20
[    7.950021]  ? count.isra.6.constprop.16+0x52/0x100
[    7.950021]  do_execveat_common.isra.14+0xef1/0x1b80
[    7.950021]  ? prepare_bprm_creds+0x100/0x100
[    7.950021]  ? getname_flags+0x90/0x3f0
[    7.950021]  ? __do_page_fault+0x5cc/0xbc0
[    7.950021]  SyS_execve+0x3a/0x50
[    7.950021]  ? ptregs_sys_vfork+0x10/0x10
[    7.950021]  do_syscall_64+0x180/0x380
[    7.950021]  entry_SYSCALL64_slow_path+0x25/0x25
[    7.950021] RIP: 0033:0x7f85de145477
[    7.950021] RSP: 002b:00007ffdf2da3568 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[    7.950021] RAX: ffffffffffffffda RBX: 0000000000d771c0 RCX: 00007f85de145477
[    7.950021] RDX: 0000000000d35fb0 RSI: 0000000000d697e0 RDI: 0000000000d771c0
[    7.950021] RBP: 0000000000000000 R08: 00007ffdf2da3560 R09: 0000000000000030
[    7.950021] R10: 000000000000059a R11: 0000000000000202 R12: 0000000000d771c0
[    7.950021] R13: 0000000000d697e0 R14: 0000000000d35fb0 R15: 0000000000000000
[    7.950021] Object at ffff88006a915b00, in cache kmalloc-1024 size: 1024
[    7.950021] Allocated:
[    7.950021] PID = 72
[    7.950021]  save_stack_trace+0x1b/0x20
[    7.950021]  kasan_kmalloc+0xee/0x190
[    7.950021]  kmem_cache_alloc_node_trace+0x138/0x200
[    7.950021]  bdi_alloc_node+0x4c/0xa0
[    7.950021]  blk_alloc_queue_node+0xdd/0x870
[    7.950021]  blk_mq_init_queue+0x41/0x90
[    7.950021]  scsi_mq_alloc_queue+0x41/0x130 [scsi_mod]
[    7.950021]  scsi_alloc_sdev+0x90e/0xd00 [scsi_mod]
[    7.950021]  scsi_probe_and_add_lun+0x14b3/0x2250 [scsi_mod]
[    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
[    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
[    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
[    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
[    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
[    7.950021]  async_run_entry_fn+0xc3/0x630
[    7.950021]  process_one_work+0x531/0xf40
[    7.950021]  worker_thread+0xe4/0x10d0
[    7.950021]  kthread+0x298/0x390
[    7.950021]  ret_from_fork+0x2c/0x40
[    7.950021] Freed:
[    7.950021] PID = 72
[    7.950021]  save_stack_trace+0x1b/0x20
[    7.950021]  kasan_slab_free+0xb0/0x180
[    7.950021]  kfree+0x9f/0x1d0
[    7.950021]  bdi_put+0x2a/0x30
[    7.950021]  blk_release_queue+0x51/0x320
[    7.950021]  kobject_put+0x154/0x430
[    7.950021]  blk_put_queue+0x15/0x20
[    7.950021]  scsi_device_dev_release_usercontext+0x59b/0x880 [scsi_mod]
[    7.950021]  execute_in_process_context+0xd9/0x130
[    7.950021]  scsi_device_dev_release+0x1c/0x20 [scsi_mod]
[    7.950021]  device_release+0x76/0x1e0
[    7.950021]  kobject_put+0x154/0x430
[    7.950021]  put_device+0x17/0x20
[    7.950021]  __scsi_remove_device+0x18d/0x250 [scsi_mod]
[    7.950021]  scsi_probe_and_add_lun+0x14d6/0x2250 [scsi_mod]
[    7.950021]  __scsi_scan_target+0x23d/0xb60 [scsi_mod]
[    7.950021]  scsi_scan_channel+0x107/0x160 [scsi_mod]
[    7.950021]  scsi_scan_host_selected+0x1bf/0x250 [scsi_mod]
[    7.950021]  do_scsi_scan_host+0x1bc/0x250 [scsi_mod]
[    7.950021]  do_scan_async+0x41/0x4b0 [scsi_mod]
[    7.950021]  async_run_entry_fn+0xc3/0x630
[    7.950021]  process_one_work+0x531/0xf40
[    7.950021]  worker_thread+0xe4/0x10d0
[    7.950021]  kthread+0x298/0x390
[    7.950021]  ret_from_fork+0x2c/0x40
[    7.950021] Memory state around the buggy address:
[    7.950021]  ffff88006a915d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]  ffff88006a915d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021] >ffff88006a915e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]                                      ^
[    7.950021]  ffff88006a915e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[    7.950021]  ffff88006a915f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
----

Any ideas Jan?

Thanks.

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help