Thread (13 messages) 13 messages, 5 authors, 2021-11-02

Re: [bug report] block/005 hangs with NVMe device and linux-block/for-next

From: Shinichiro Kawasaki <hidden>
Date: 2021-11-02 08:28:52

On Nov 02, 2021 / 11:44, Ming Lei wrote:
On Tue, Nov 02, 2021 at 02:22:15AM +0000, Shinichiro Kawasaki wrote:
quoted
On Nov 01, 2021 / 17:01, Jens Axboe wrote:
quoted
On 11/1/21 6:41 AM, Jens Axboe wrote:
quoted
On 11/1/21 2:34 AM, Shinichiro Kawasaki wrote:
quoted
I tried the latest linux-block/for-next branch tip (git hash b43fadb6631f and
observed a process hang during blktests block/005 run on a NVMe device.
Kernel message reported "INFO: task check:1224 blocked for more than 122
seconds." with call trace [1]. So far, the hang is 100% reproducible with my
system. This hang is not observed with HDDs or null_blk devices.

I bisected and found the commit 4f5022453acd ("nvme: wire up completion batching
for the IRQ path") triggers the hang. When I revert this commit from the
for-next branch tip, the hang disappears. The block/005 test case does IO
scheduler switch during IO, and the completion path change by the commit looks
affecting the scheduler switch. Comments for solution will be appreciated.
I'll take a look at this.
I've tried running various things most of the day, and I cannot
reproduce this issue nor do I see what it could be. Even if requests are
split between batched completion and one-by-one completion, it works
just fine for me. No special care needs to be taken for put_many() on
the queue reference, as the wake_up() happens for the ref going to zero.

Tell me more about your setup. What does the runtimes of the test look
like? Do you have all schedulers enabled? What kind of NVMe device is
this?
Thank you for spending your precious time. With the kernel without the hang,
the test case completes around 20 seconds. When the hang happens, the check
script process stops at blk_mq_freeze_queue_wait() at scheduler change, and fio
workload processes stop at __blkdev_direct_IO_simple(). The test case does not
end, so I need to reboot the system for the next trial. While waiting the test
case completion, the kernel repeats the same INFO message every 2 minutes.

Regarding the scheduler, I compiled the kernel with mq-deadline and kyber.

The NVMe device I use is a U.2 NVMe ZNS SSD. It has a zoned name space and
a regular name space, and the hang is observed with both name spaces. I have
not yet tried other NVME devices, so I will try them.
quoted
FWIW, this is upstream now, so testing with Linus -git would be
preferable.
I see. I have switched from linux-block for-next branch to the upstream branch
of Linus. At git hash 879dbe9ffebc, and still the hang is observed.
Can you post the blk-mq debugfs log after the hang is triggered?

(cd /sys/kernel/debug/block/nvme0n1 && find . -type f -exec grep -aH . {} \;)
Thanks Ming. When I ran the command above, the grep command stopped when it
opened tag related files in the debugfs tree. That grep command looked hanking
also. So I used the find command below instead to exclude the tag related files.

# find . -type f -not -name *tag* -exec grep -aH . {} \;

Here I share the captured log.

./sched/queued:0 8 0
./sched/owned_by_driver:0 8 0
./sched/async_depth:192
./sched/starved:0
./sched/batching:1
./rqos/wbt/wb_background:4
./rqos/wbt/wb_normal:8
./rqos/wbt/unknown_cnt:0
./rqos/wbt/min_lat_nsec:2000000
./rqos/wbt/inflight:0: inflight 0
./rqos/wbt/inflight:1: inflight 0
./rqos/wbt/inflight:2: inflight 0
./rqos/wbt/id:0
./rqos/wbt/enabled:1
./rqos/wbt/curr_win_nsec:0
./hctx7/type:default
./hctx7/dispatch_busy:7
./hctx7/active:1024
./hctx7/run:5
./hctx7/ctx_map:00000000: 00
./hctx7/dispatch:000000003dfed3fd {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx7/dispatch:0000000077876d9e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=193}
./hctx7/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx7/state:TAG_ACTIVE|SCHED_RESTART
./hctx6/type:default
./hctx6/dispatch_busy:5
./hctx6/active:1025
./hctx6/run:4
./hctx6/ctx_map:00000000: 00
./hctx6/dispatch:00000000c0b8e1c9 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx6/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx6/state:TAG_ACTIVE|SCHED_RESTART
./hctx5/type:default
./hctx5/dispatch_busy:5
./hctx5/active:1024
./hctx5/run:4
./hctx5/ctx_map:00000000: 00
./hctx5/dispatch:00000000aaf1e364 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx5/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx5/state:TAG_ACTIVE|SCHED_RESTART
./hctx4/type:default
./hctx4/dispatch_busy:0
./hctx4/active:1023
./hctx4/run:1
./hctx4/ctx_map:00000000: 00
./hctx4/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx3/type:default
./hctx3/dispatch_busy:5
./hctx3/active:1024
./hctx3/run:4
./hctx3/ctx_map:00000000: 00
./hctx3/dispatch:000000008b07d5e1 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx3/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx3/state:TAG_ACTIVE|SCHED_RESTART
./hctx2/type:default
./hctx2/dispatch_busy:5
./hctx2/active:1024
./hctx2/run:4
./hctx2/ctx_map:00000000: 00
./hctx2/dispatch:00000000c4887013 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=128}
./hctx2/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx2/state:TAG_ACTIVE|SCHED_RESTART
./hctx1/type:default
./hctx1/dispatch_busy:6
./hctx1/active:1024
./hctx1/run:5
./hctx1/ctx_map:00000000: 00
./hctx1/dispatch:00000000efe38e4e {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=0}
./hctx1/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx1/state:TAG_ACTIVE|SCHED_RESTART
./hctx0/type:default
./hctx0/dispatch_busy:5
./hctx0/active:1024
./hctx0/run:4
./hctx0/ctx_map:00000000: 00
./hctx0/dispatch:0000000015147095 {.op=READ, .cmd_flags=, .rq_flags=STARTED|ELVPRIV|IO_STAT|22, .state=idle, .tag=-1, .internal_tag=192}
./hctx0/flags:alloc_policy=FIFO SHOULD_MERGE|TAG_QUEUE_SHARED
./hctx0/state:TAG_ACTIVE|SCHED_RESTART
./write_hints:hint0: 0
./write_hints:hint1: 0
./write_hints:hint2: 0
./write_hints:hint3: 0
./write_hints:hint4: 0
./state:SAME_COMP|NONROT|IO_STAT|DISCARD|INIT_DONE|STATS|REGISTERED|PCI_P2PDMA|NOWAIT
./pm_only:0
./poll_stat:read  (512 Bytes): samples=0
./poll_stat:write (512 Bytes): samples=0
./poll_stat:read  (1024 Bytes): samples=0
./poll_stat:write (1024 Bytes): samples=0
./poll_stat:read  (2048 Bytes): samples=0
./poll_stat:write (2048 Bytes): samples=0
./poll_stat:read  (4096 Bytes): samples=0
./poll_stat:write (4096 Bytes): samples=0
./poll_stat:read  (8192 Bytes): samples=0
./poll_stat:write (8192 Bytes): samples=0
./poll_stat:read  (16384 Bytes): samples=0
./poll_stat:write (16384 Bytes): samples=0
./poll_stat:read  (32768 Bytes): samples=0
./poll_stat:write (32768 Bytes): samples=0
./poll_stat:read  (65536 Bytes): samples=0
./poll_stat:write (65536 Bytes): samples=0


-- 
Best Regards,
Shin'ichiro Kawasaki
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help