Thread (35 messages) 35 messages, 6 authors, 2018-08-22

Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

From: van der Linden, Frank <hidden>
Date: 2018-08-22 16:45:00
Also in: lkml

On 8/22/18 7:27 AM, Jens Axboe wrote:=0A=
On 8/22/18 6:54 AM, Holger Hoffst=E4tte wrote:=0A=
quoted
On 08/22/18 06:10, Jens Axboe wrote:=0A=
quoted
[...]=0A=
If you have time, please look at the 3 patches I posted earlier today.=
=0A=
quoted
quoted
Those are for mainline, so should be OK :-)=0A=
I'm just playing along at home but with those 3 I get repeatable=0A=
hangs & writeback not starting at all, but curiously *only* on my btrfs=
=0A=
quoted
device; for inexplicable reasons some other devices with ext4/xfs flush=
=0A=
quoted
properly. Yes, that surprised me too, but it's repeatable.=0A=
Now this may or may not have something to do with some of my in-testing=
=0A=
quoted
patches for btrfs itself, but if I remove those 3 wbt fixes, everything=
=0A=
quoted
is golden again. Not eager to repeat since it hangs sync & requires a=0A=
hard reboot.. :(=0A=
Just thought you'd like to know.=0A=
Thanks, that's very useful info! I'll see if I can reproduce that.=0A=
=0A=
I think it might be useful to kind of give a dump of what we discussed=0A=
before this patch was sent, there was a little more than was in the=0A=
description.=0A=
=0A=
We saw hangs and heavy lock contention in the wbt code under a specific=0A=
workload, on XFS. Crash dump analysis showed the following issues:=0A=
=0A=
1) wbt_done uses wake_up_all, which causes a thundering herd=0A=
2) __wbt_wait sets up a wait queue with the auto remove wake function=0A=
(via DEFINE_WAIT), which caused two problems:=0A=
   * combined with the use of wake_up_all, the wait queue would=0A=
essentially be randomly reordered for tasks that did not get to run=0A=
   * the waitqueue_active check in may_queue was not valid with the auto=0A=
remove function, which could lead incoming tasks with requests to=0A=
overtake existing requests=0A=
=0A=
1) was fixed by using a plain wake_up=0A=
2) was fixed by keeping tasks on the queue until they could break out of=0A=
the wait loop in __wbt_wait=0A=
=0A=
=0A=
The random reordering, causing task starvation in __wbt_wait, was the=0A=
main problem. Simply not using the auto remove wait function, e.g.=0A=
*only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,=0A=
default_wake_function), fixed the hang / task starvation issue in our=0A=
tests. But there was still more lock contention than there should be, so=0A=
we also changed wake_up_all to wake_up.=0A=
=0A=
It might be useful to run your tests with only the DEFINE_WAIT change I=0A=
describe above added to the original code to see if that still has any=0A=
problems. That would give a good datapoint whether any remaining issues=0A=
are due to missed wakeups or not.=0A=
=0A=
There is the issue of making forward progress, or at least making it=0A=
fast enough. With the changes as they stand now, you could come up with=0A=
a scenario where the throttling limit is hit, but then is raised. Since=0A=
there might still be a wait queue, you could end up putting each=0A=
incoming task to sleep, even though it's not needed.=0A=
=0A=
One way to guarantee that the wait queue clears up as fast as possible,=0A=
without resorting to wakeup_all, is to use wakeup_nr, where the number=0A=
of tasks to wake up is (limit - inflight).=0A=
=0A=
Also, to avoid having tasks going back to sleep in the loop, you could=0A=
do what you already proposed - always just sleep at most once, and=0A=
unconditionally proceed after waking up. To avoid incoming tasks=0A=
overtaking the ones that are being woken up, you could have wbt_done=0A=
increment inflight, effectively reserving a spot for the tasks that are=0A=
about to be woken up.=0A=
=0A=
Another thing I thought about was recording the number of waiters in the=0A=
wait queue, and modify the check from (inflight < limit) to (inflight <=0A=
(limit - nwaiters)), and no longer use any waitqueue_active checks.=0A=
=0A=
The condition checks are of course complicated by the fact that=0A=
condition manipulation is not always done under the same lock (e.g.=0A=
wbt_wait can be called with a NULL lock).=0A=
=0A=
=0A=
So, these are just some of the things to consider here - maybe there's=0A=
nothing in there that you hadn't already considered, but I thought it'd=0A=
be useful to summarize them.=0A=
=0A=
Thanks for looking in to this!=0A=
=0A=
Frank=0A=
=0A=
=0A=
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help