Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

From: van der Linden, Frank <hidden>
Date: 2018-08-20 20:19:38
Also in: lkml

On 8/20/18 12:29 PM, Jens Axboe wrote:=0A=

On 8/20/18 1:08 PM, Jens Axboe wrote:=0A=

quoted

On 8/20/18 11:34 AM, van der Linden, Frank wrote:=0A=

quoted

On 8/20/18 9:37 AM, Jens Axboe wrote:=0A=

quoted

On 8/7/18 3:19 PM, Jens Axboe wrote:=0A=

quoted

On 8/7/18 3:12 PM, Anchal Agarwal wrote:=0A=

quoted

On Tue, Aug 07, 2018 at 02:39:48PM -0600, Jens Axboe wrote:=0A=

quoted

On 8/7/18 2:12 PM, Anchal Agarwal wrote:=0A=

quoted

On Tue, Aug 07, 2018 at 08:29:44AM -0600, Jens Axboe wrote:=0A=

quoted

On 8/1/18 4:09 PM, Jens Axboe wrote:=0A=

quoted

On 8/1/18 11:06 AM, Anchal Agarwal wrote:=0A=

quoted

On Wed, Aug 01, 2018 at 09:14:50AM -0600, Jens Axboe wrote:=0A=

quoted

On 7/31/18 3:34 PM, Anchal Agarwal wrote:=0A=

quoted

Hi folks,=0A=
=0A=
This patch modifies commit e34cbd307477a=0A=
(blk-wbt: add general throttling mechanism)=0A=
=0A=
I am currently running a large bare metal instance (i3.metal)=

=0A=

quoted

on EC2 with 72 cores, 512GB of RAM and NVME drives, with a=0A=
4.18 kernel. I have a workload that simulates a database=0A=
workload and I am running into lockup issues when writeback=

=0A=

quoted

throttling is enabled,with the hung task detector also=0A=
kicking in.=0A=
=0A=
Crash dumps show that most CPUs (up to 50 of them) are=0A=
all trying to get the wbt wait queue lock while trying to add=

=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
themselves to it in __wbt_wait (see stack traces below).=0A=
=0A=
[    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.=

14.51-62.38.amzn1.x86_64 #1=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specifi=

ed, BIOS 1.0 10/16/2017=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c6=

9c000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf=

8/0x1a0=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046=0A=
[    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RC=

X: ffff883f7fce2a00=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RD=

I: ffff887f7709ca68=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R0=

9: 0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R1=

2: 0000000000000002=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948131] R13: 0000000000000003 R14: 0000000000000000 R1=

5: 0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc000=

0(0000) knlGS:0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050=

033=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR=

4: 00000000003606e0=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR=

2: 0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR=

7: 0000000000000400=0A=

quoted

[    0.948138] Call Trace:=0A=
[    0.948139]  <IRQ>=0A=
[    0.948142]  do_raw_spin_lock+0xad/0xc0=0A=
[    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
[    0.948149]  ? __wake_up_common_lock+0x53/0x90=0A=
[    0.948150]  __wake_up_common_lock+0x53/0x90=0A=
[    0.948155]  wbt_done+0x7b/0xa0=0A=
[    0.948158]  blk_mq_free_request+0xb7/0x110=0A=
[    0.948161]  __blk_mq_complete_request+0xcb/0x140=0A=
[    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]=0A=
[    0.948169]  nvme_irq+0x23/0x50 [nvme]=0A=
[    0.948173]  __handle_irq_event_percpu+0x46/0x300=0A=
[    0.948176]  handle_irq_event_percpu+0x20/0x50=0A=
[    0.948179]  handle_irq_event+0x34/0x60=0A=
[    0.948181]  handle_edge_irq+0x77/0x190=0A=
[    0.948185]  handle_irq+0xaf/0x120=0A=
[    0.948188]  do_IRQ+0x53/0x110=0A=
[    0.948191]  common_interrupt+0x87/0x87=0A=
[    0.948192]  </IRQ>=0A=
....=0A=
[    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tai=

nted 4.14.51-62.38.amzn1.x86_64 #1=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specifi=

ed, BIOS 1.0 10/16/2017=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1=

ec000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf=

5/0x1a0=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046=0A=
[    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RC=

X: ffff883f7f722a00=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RD=

I: ffff887f7709ca68=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R0=

9: 0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R1=

2: ffff887f7709ca68=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R1=

5: ffff887f7709ca00=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f70000=

0(0000) knlGS:0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050=

033=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR=

4: 00000000003606e0=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR=

2: 0000000000000000=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
[    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR=

7: 0000000000000400=0A=

quoted

[    0.311154] Call Trace:=0A=
[    0.311157]  do_raw_spin_lock+0xad/0xc0=0A=
[    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b=0A=
[    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0=0A=
[    0.311164]  prepare_to_wait_exclusive+0x28/0xb0=0A=
[    0.311167]  wbt_wait+0x127/0x330=0A=
[    0.311169]  ? finish_wait+0x80/0x80=0A=
[    0.311172]  ? generic_make_request+0xda/0x3b0=0A=
[    0.311174]  blk_mq_make_request+0xd6/0x7b0=0A=
[    0.311176]  ? blk_queue_enter+0x24/0x260=0A=
[    0.311178]  ? generic_make_request+0xda/0x3b0=0A=
[    0.311181]  generic_make_request+0x10c/0x3b0=0A=
[    0.311183]  ? submit_bio+0x5c/0x110=0A=
[    0.311185]  submit_bio+0x5c/0x110=0A=
[    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]=0A=
[    0.311210]  ext4_io_submit+0x48/0x60 [ext4]=0A=
[    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]=0A=
[    0.311229]  ? do_writepages+0x3c/0xd0=0A=
[    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]=0A=
[    0.311240]  do_writepages+0x3c/0xd0=0A=
[    0.311243]  ? _raw_spin_unlock+0x24/0x30=0A=
[    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280=0A=
[    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0=0A=
[    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0=0A=
[    0.311253]  file_write_and_wait_range+0x34/0x90=0A=
[    0.311264]  ext4_sync_file+0x151/0x500 [ext4]=0A=
[    0.311267]  do_fsync+0x38/0x60=0A=
[    0.311270]  SyS_fsync+0xc/0x10=0A=
[    0.311272]  do_syscall_64+0x6f/0x170=0A=
[    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7=0A=
=0A=
In the original patch, wbt_done is waking up all the exclusiv=

e=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
processes in the wait queue, which can cause a thundering her=

d=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
if there is a large number of writer threads in the queue. Th=

e=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
original intention of the code seems to be to wake up one thr=

ead=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
only however, it uses wake_up_all() in __wbt_done(), and then=

=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
uses the following check in __wbt_wait to have only one threa=

d=0A=

quoted

actually get out of the wait loop:=0A=
=0A=
if (waitqueue_active(&rqw->wait) &&=0A=
            rqw->wait.head.next !=3D &wait->entry)=0A=
                return false;=0A=
=0A=
The problem with this is that the wait entry in wbt_wait is=

=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
define with DEFINE_WAIT, which uses the autoremove wakeup fun=

ction.=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
That means that the above check is invalid - the wait entry w=

ill=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
have been removed from the queue already by the time we hit t=

he=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
check in the loop.=0A=
=0A=
Secondly, auto-removing the wait entries also means that the =

wait=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
queue essentially gets reordered "randomly" (e.g. threads re-=

add=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
themselves in the order they got to run after being woken up)=

.=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Additionally, new requests entering wbt_wait might overtake r=

equests=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
that were queued earlier, because the wait queue will be=0A=
(temporarily) empty after the wake_up_all, so the waitqueue_a=

ctive=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
check will not stop them. This can cause certain threads to s=

tarve=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
under high load.=0A=
=0A=
The fix is to leave the woken up requests in the queue and re=

move=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
them in finish_wait() once the current thread breaks out of t=

he=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
wait loop in __wbt_wait. This will ensure new requests always=

=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
end up at the back of the queue, and they won't overtake requ=

ests=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
that are already in the wait queue. With that change, the loo=

p=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
in wbt_wait is also in line with many other wait loops in the=

 kernel.=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
Waking up just one thread drastically reduces lock contention=

, as=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
does moving the wait queue add/remove out of the loop.=0A=
=0A=
A significant drop in lockdep's lock contention numbers is se=

en when=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
running the test application on the patched kernel.=0A=
I like the patch, and a few weeks ago we independently discove=

red that=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
the waitqueue list checking was bogus as well. My only worry i=

s that=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
changes like this can be delicate, meaning that it's easy to i=

ntroduce=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
stall conditions. What kind of testing did you push this throu=

gh?=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
=0A=
-- =0A=
Jens Axboe=0A=
=0A=
I ran the following tests on both real HW with NVME devices att=

ached=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
and emulated NVME too:=0A=
=0A=
1. The test case I used to reproduce the issue, spawns a bunch =

of threads =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
   to concurrently read and write files with random size and co=

ntent. =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
   Files are randomly fsync'd. The implementation is a FIFO que=

ue of files. =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
   When the queue fills the test starts to verify and remove th=

e files. This =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
   test will fail if there's a read, write, or hash check failu=

re. It tests=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
   for file corruption when lots of small files are being read =

and written =0A=

quoted

   with high concurrency.=0A=
=0A=
2. Fio for random writes with a root NVME device of 200GB=0A=
  =0A=
  fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=

=3Drandwrite --bs=3D4k =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
  --direct=3D0 --size=3D10G --numjobs=3D2 --runtime=3D60 --grou=

p_reporting=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
  =0A=
  fio --name=3Drandwrite --ioengine=3Dlibaio --iodepth=3D1 --rw=

=3Drandwrite --bs=3D4k=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
  --direct=3D0 --size=3D5G --numjobs=3D2 --runtime=3D30 --fsync=

=3D64 --group_reporting=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
  =0A=
  I did see an improvement in the bandwidth numbers reported on=

 the patched=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
  kernel. =0A=
=0A=
Do you have any test case/suite in mind that you would suggest =

me to =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
run to be sure that patch does not introduce any stall conditio=

ns?=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
One thing that is always useful is to run xfstest, do a full run=

 on=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
the device. If that works, then do another full run, this time l=

imiting=0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
the queue depth of the SCSI device to 1. If both of those pass, =

then=0A=

quoted

I'd feel pretty good getting this applied for 4.19.=0A=

Did you get a chance to run this full test?=0A=
=0A=
-- =0A=
Jens Axboe=0A=
=0A=
=0A=

Hi Jens,=0A=
Yes I did run the tests and was in the process of compiling concre=

te results=0A=

quoted

I tested following environments against xfs/auto group=0A=
1. Vanilla 4.18.rc kernel=0A=
2. 4.18 kernel with the blk-wbt patch=0A=
3. 4.18 kernel with the blk-wbt patch + io_queue_depth=3D2. I =0A=
understand you asked for queue depth for SCSI device=3D1 however, =

I have NVME =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
devices in my environment and 2 is the minimum value for io_queue_=

depth allowed =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
according to the NVME driver code. The results pretty much look sa=

me with no =0A=

quoted

stalls or exceptional failures. =0A=
xfs/auto ran 296 odd tests with 3 failures and 130 something "no r=

uns". =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
Remaining tests passed. "Skipped tests"  were mostly due to missin=

g features=0A=

quoted

(eg: reflink support on scratch filesystem)=0A=
The failures were consistent across runs on 3 different environmen=

ts. =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
I am also running full test suite but it is taking long time as I =

am =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
hitting kernel BUG in xfs code in some generic tests. This BUG is =

not =0A=

quoted
quoted
quoted
quoted
quoted
quoted
quoted
related to the patch and  I see them in vanilla kernel too. I am i=

n =0A=

quoted

the process of excluding these kind of tests as they come and =0A=
re-run the suite however, this proces is time taking. =0A=
Do you have any specific tests in mind that you would like me =0A=
to run apart from what I have already tested above?=0A=

Thanks, I think that looks good. I'll get your patch applied for=0A=
4.19.=0A=
=0A=
-- =0A=
Jens Axboe=0A=
=0A=
=0A=

Hi Jens,=0A=
Thanks for accepting this. There is one small issue, I don't find an=

y emails=0A=

quoted
quoted
quoted
quoted
quoted
send by me on the lkml mailing list. I am not sure why it didn't lan=

d there,=0A=

quoted
quoted
quoted
quoted
quoted
all I can see is your responses. Do you want one of us to resend the=

 patch=0A=

quoted

or will you be able to do it?=0A=

That's odd, are you getting rejections on your emails? For reference,=

 the=0A=

quoted

patch is here:=0A=
=0A=
http://git.kernel.dk/cgit/linux-block/commit/?h=3Dfor-4.19/block&id=

=3D2887e41b910bb14fd847cf01ab7a5993db989d88=0A=

quoted

One issue with this, as far as I can tell. Right now we've switched to=

=0A=

quoted

waking one task at the time, which is obviously more efficient. But if=

=0A=

quoted

we do that with exclusive waits, then we have to ensure that this task=

=0A=

quoted

makes progress. If we wake up a task, and then fail to get a queueing=

=0A=

quoted

token, then we'll go back to sleep. We need to ensure that someone mak=

es=0A=

quoted

forward progress at this point. There are two ways I can see that=0A=
happening:=0A=
=0A=
1) The task woken _always_ gets to queue an IO=0A=
2) If the task woken is NOT allowed to queue an IO, then it must selec=

t=0A=

quoted

   a new task to wake up. That new task is then subjected to rule 1 or=

 2=0A=

quoted

   as well.=0A=
=0A=
For #1, it could be as simple as:=0A=
=0A=
if (slept || !rwb_enabled(rwb)) {=0A=
	atomic_inc(&rqw->inflight);=0A=
	break;=0A=
}=0A=
=0A=
but this obviously won't always be fair. Might be good enough however,=

=0A=

quoted

instead of having to eg replace the generic wait queues with a priorit=

y=0A=

quoted

list/queue.=0A=
=0A=
Note that this isn't an entirely new issue, it's just so much easier t=

o=0A=

quoted

hit with the single wakeups.=0A=
=0A=

Hi Jens,=0A=
=0A=
What is the scenario that you see under which the woken up task does no=

t=0A=

quoted

get to run?=0A=

That scenario is pretty easy to hit - let's say the next in line task=0A=
has a queue limit of 1, and we currently have 4 pending. Task gets=0A=
woken, goes back to sleep. Which should be totally fine. At some point=

=0A=

quoted

we'll get below the limit, and allow the task to proceed. This will=0A=
ensure forward progress.=0A=
=0A=

quoted

The theory behind leaving the task on the wait queue is that the=0A=
waitqueue_active check in wbt_wait prevents new tasks from taking up a=

=0A=

quoted

slot in the queue (e.g. incrementing inflight). So, there should not be=

=0A=

quoted

a way for inflight to be incremented between the time the wake_up is=0A=
done and the task at the head of the wait queue runs. That's the idea=

=0A=

quoted

anyway :-) If we missed something, let us know.=0A=

And that's a fine theory, I think it's a good improvement (and how it=0A=
should have worked). I'm struggling to see where the issue is. Perhaps=

=0A=

quoted

it's related to the wq active check. With fewer wakeups, we're more=0A=
likely to hit a race there.=0A=
=0A=
I'll poke at it...=0A=

Trying something like this:=0A=
=0A=
http://git.kernel.dk/cgit/linux-block/log/?h=3Dfor-4.19/wbt=0A=
=0A=

Ah, now I see what you mean.=0A=
=0A=
This is the case where a task goes to sleep, not because the inflight=0A=
limit has been reached, but simply because it needs to go to the back of=0A=
the wait queue.=0A=
=0A=
In that case, it should, for its first time inside the loop, not try to=0A=
decrement inflight - since that means it could still race to overtake a=0A=
task that got there earlier and is in the wait queue.=0A=
=0A=
So what you are doing is keeping track of whether it got in to the loop=0A=
only because of queueing, and then you don't try to decrement inflight=0A=
the first time around the loop.=0A=
=0A=
I think that should work to fix that corner case.=0A=
=0A=
Frank=0A=
=0A=

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help