Re: 4.14: WARNING: CPU: 4 PID: 2895 at block/blk-mq.c:1144 with virtio-blk
From: Christian Borntraeger <hidden>
Date: 2017-11-20 20:49:15
Also in:
lkml
On 11/20/2017 08:42 PM, Jens Axboe wrote:
On 11/20/2017 12:29 PM, Christian Borntraeger wrote:quoted
On 11/20/2017 08:20 PM, Bart Van Assche wrote:quoted
On Fri, 2017-11-17 at 15:42 +0100, Christian Borntraeger wrote:quoted
This is b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1141) * are mapped to it. b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1142) */ 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1143) WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) && 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1144) cpu_online(hctx->next_cpu)); 6a83e74d2 (Bart Van Assche 2016-11-02 10:09:51 -0600 1145) b7a71e66d (Jens Axboe 2017-08-01 09:28:24 -0600 1146) /*Did you really try to figure out when the code that reported the warning was introduced? I think that warning was introduced through the following commit:This was more a cut'n'paste to show which warning triggered since line numbers are somewhat volatile.quoted
commit fd1270d5df6a005e1248e87042159a799cc4b2c9 Date: Wed Apr 16 09:23:48 2014 -0600 blk-mq: don't use preempt_count() to check for right CPU UP or CONFIG_PREEMPT_NONE will return 0, and what we really want to check is whether or not we are on the right CPU. So don't make PREEMPT part of this, just test the CPU in the mask directly. Anyway, I think that warning is appropriate and useful. So the next step is to figure out what work item was involved and why that work item got executed on the wrong CPU.It seems to be related to virtio-blk (is triggered by fio on such disks). Your comment basically says: "no this is not a known issue" then :-) I will try to take a dump to find out the work itemblk-mq does not attempt to freeze/sync existing work if a CPU goes away, and we reconfigure the mappings. So I don't think the above is unexpected, if you are doing CPU hot unplug while running a fio job.
I did a cpu hot plug (adding a CPU) and I started fio AFTER that.
While it's a bit annoying that we trigger the WARN_ON() for a condition that can happen, we're basically interested in it if it triggers for normal operations.
I think we should never trigger a WARN_ON on conditions that can happen. I know some folks enabling panic_on_warn to detect/avoid data integrity issues. FWIW, this also seems to happen wit 4.13 and 4.12