Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk (also 4.12 stable)
From: Christian Borntraeger <hidden>
Date: 2017-11-21 10:14:50
Also in:
lkml
On 11/21/2017 10:50 AM, Christian Borntraeger wrote:
On 11/21/2017 09:35 AM, Christian Borntraeger wrote:quoted
On 11/20/2017 09:52 PM, Jens Axboe wrote:quoted
On 11/20/2017 01:49 PM, Christian Borntraeger wrote:quoted
On 11/20/2017 08:42 PM, Jens Axboe wrote:quoted
On 11/20/2017 12:29 PM, Christian Borntraeger wrote:quoted
On 11/20/2017 08:20 PM, Bart Van Assche wrote:quoted
On Fri, 2017-11-17 at 15:42 +0100, Christian Borntraeger wrote:quoted
This is b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1141) * are mapped to it. b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1142) */ 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1143) WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1144) cpu_online(hctx->next_cpu)); 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1145) b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1146) /*Did you really try to figure out when the code that reported the warning was introduced? I think that warning was introduced through the following commit:This was more a cut'n'paste to show which warning triggered since line numbers are somewhat volatile.quoted
commit fd1270d5df6a005e1248e87042159a799cc4b2c9 Date: Wed Apr 16 09:23:48 2014 -0600 blk-mq: don't use preempt_count() to check for right CPU UP or CONFIG_PREEMPT_NONE will return 0, and what we really want to check is whether or not we are on the right CPU. So don't make PREEMPT part of this, just test the CPU in the mask directly. Anyway, I think that warning is appropriate and useful. So the next step is to figure out what work item was involved and why that work item got executed on the wrong CPU.It seems to be related to virtio-blk (is triggered by fio on such disks). Your comment basically says: "no this is not a known issue" then :-) I will try to take a dump to find out the work itemblk-mq does not attempt to freeze/sync existing work if a CPU goes away, and we reconfigure the mappings. So I don't think the above is unexpected, if you are doing CPU hot unplug while running a fio job.I did a cpu hot plug (adding a CPU) and I started fio AFTER that.OK, that's different, we should not be triggering a warning for that. What does your machine/virtblk topology look like in terms of CPUS, nr of queues for virtblk, etc?FWIW, 4.11 does work, 4.12 and later is broken.In fact: 4.12 is fine, 4.12.14 is broken.
Bisect points to
1b5a7455d345b223d3a4658a9e5fce985b7998c1 is the first bad commit
commit 1b5a7455d345b223d3a4658a9e5fce985b7998c1
Author: Christoph Hellwig [off-list ref]
Date: Mon Jun 26 12:20:57 2017 +0200
blk-mq: Create hctx for each present CPU
commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.
Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations. Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.
Signed-off-by: Christoph Hellwig [off-list ref]
Reviewed-by: Jens Axboe [off-list ref]
Cc: Keith Busch [off-list ref]
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner [off-list ref]
Cc: Oleksandr Natalenko [off-list ref]
Cc: Mike Galbraith [off-list ref]
Signed-off-by: Greg Kroah-Hartman [off-list ref]
:040000 040000 a61cb023014a7b7a6b9f24ea04fe8ab22299e706 059ba6dc3290c74e0468937348e580cd53f963e7 M block
:040000 040000 432e719d7e738ffcddfb8fc964544d3b3e0a68f7 f4572aa21b249a851a1b604c148eea109e93b30d M include
adding Christoph FWIW, your patch triggers the following on 4.14 when doing a cpu hotplug (adding a
CPU) and then accessing a virtio-blk device.
747.652408] ------------[ cut here ]------------
[ 747.652410] WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 __blk_mq_run_hw_queue+0xd4/0x100
[ 747.652410] Modules linked in: dm_multipath
[ 747.652412] CPU: 4 PID: 2895 Comm: kworker/4:1H Tainted: G W 4.14.0+ #191
[ 747.652412] Hardware name: IBM 2964 NC9 704 (KVM/Linux)
[ 747.652414] Workqueue: kblockd blk_mq_run_work_fn
[ 747.652414] task: 0000000060680000 task.stack: 000000005ea30000
[ 747.652415] Krnl PSW : 0704f00180000000 0000000000505864 (__blk_mq_run_hw_queue+0xd4/0x100)
[ 747.652417] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:3 PM:0 RI:0 EA:3
[ 747.652417] Krnl GPRS: 0000000000000010 00000000000000ff 000000005cbec400 0000000000000000
[ 747.652418] 0000000063709120 0000000000000000 0000000063709500 0000000059fa44b0
[ 747.652418] 0000000059fa4480 0000000000000000 000000006370f700 0000000063709100
[ 747.652419] 000000005cbec500 0000000000970948 000000005ea33d80 000000005ea33d48
[ 747.652423] Krnl Code: 0000000000505854: ebaff0a00004 lmg %r10,%r15,160(%r15)
000000000050585a: c0f4ffe690d3 brcl 15,1d7a00
#0000000000505860: a7f40001 brc 15,505862
>0000000000505864: 581003b0 l %r1,944
0000000000505868: c01b001fff00 nilf %r1,2096896
000000000050586e: a784ffdb brc 8,505824
0000000000505872: a7f40001 brc 15,505874
0000000000505876: 9120218f tm 399(%r2),32
[ 747.652435] Call Trace:
[ 747.652435] ([<0000000063709600>] 0x63709600)
[ 747.652436] [<0000000000187bcc>] process_one_work+0x264/0x4b8
[ 747.652438] [<0000000000187e78>] worker_thread+0x58/0x4f8
[ 747.652439] [<000000000018ee94>] kthread+0x144/0x168
[ 747.652439] [<00000000008f8a62>] kernel_thread_starter+0x6/0xc
[ 747.652440] [<00000000008f8a5c>] kernel_thread_starter+0x0/0xc
[ 747.652440] Last Breaking-Event-Address:
[ 747.652441] [<0000000000505860>] __blk_mq_run_hw_queue+0xd0/0x100
[ 747.652442] ---[ end trace 4a001a80379b18ba ]---
[ 747.652450] ------------[ cut here ]------------