Thread (58 messages) 58 messages, 10 authors, 2022-09-06

Re: stalling IO regression since linux 5.12, through 5.18

From: Nikolay Borisov <hidden>
Date: 2022-08-16 15:25:43
Also in: linux-block, linux-btrfs, lkml


On 16.08.22 г. 17:22 ч., Chris Murphy wrote:

On Sun, Aug 14, 2022, at 4:28 PM, Chris Murphy wrote:
quoted
On Fri, Aug 12, 2022, at 2:02 PM, Jens Axboe wrote:
quoted
Might be worth trying to revert those from 5.12 to see if they are
causing the issue? Jan, Paolo - does this ring any bells?
git log --oneline --no-merges v5.11..c03c21ba6f4e > bisect.txt

I tried checking out a33df75c6328, which is right before the first bfq
commit, but that kernel won't boot the hardware.

Next I checked out v5.12, then reverted these commits in order (that
they were found in the bisect.txt file):

7684fbde4516 bfq: Use only idle IO periods for think time calculations
28c6def00919 bfq: Use 'ttime' local variable
41e76c85660c bfq: Avoid false bfq queue merging
quoted
quoted
quoted
a5bf0a92e1b8 bfq: bfq_check_waker() should be static
71217df39dc6 block, bfq: make waker-queue detection more robust
5a5436b98d5c block, bfq: save also injection state on queue merging
e673914d52f9 block, bfq: save also weight-raised service on queue merging
d1f600fa4732 block, bfq: fix switch back from soft-rt weitgh-raising
7f1995c27b19 block, bfq: re-evaluate convenience of I/O plugging on rq arrivals
eb2fd80f9d2c block, bfq: replace mechanism for evaluating I/O intensity
quoted
quoted
quoted
1a23e06cdab2 bfq: don't duplicate code for different paths
2391d13ed484 block, bfq: do not expire a queue when it is the only busy
one
3c337690d2eb block, bfq: avoid spurious switches to soft_rt of
interactive queues
91b896f65d32 block, bfq: do not raise non-default weights
ab1fb47e33dc block, bfq: increase time window for waker detection
d4fc3640ff36 block, bfq: set next_rq to waker_bfqq->next_rq in waker
injection
b5f74ecacc31 block, bfq: use half slice_idle as a threshold to check
short ttime

The two commits prefixed by >>> above were not previously mentioned by
Jens, but I reverted them anyway because they showed up in the git log
command.

OK so, within 10 minutes the problem does happen still. This is
block/bfq-iosched.c resulting from the above reverts, in case anyone
wants to double check what I did:
https://drive.google.com/file/d/1ykU7MpmylJuXVobODWiiaLJk-XOiAjSt/view?usp=sharing
Any suggestions for further testing? I could try go down farther in the bisect.txt list. The problem is if the hardware falls over on an unbootable kernel, I have to bug someone with LOM access. That's a limited resource.
How about changing the scheduler either mq-deadline or noop, just to see 
if this is also reproducible with a different scheduler. I guess noop 
would imply the blk cgroup controller is going to be disabled
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help