Thread (60 messages) 60 messages, 6 authors, 2018-01-29

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

From: Ming Lei <hidden>
Date: 2018-01-19 18:25:00
Also in: dm-devel, lkml

On Fri, Jan 19, 2018 at 10:38:41AM -0700, Jens Axboe wrote:
On 1/19/18 9:37 AM, Ming Lei wrote:
quoted
On Fri, Jan 19, 2018 at 09:27:46AM -0700, Jens Axboe wrote:
quoted
On 1/19/18 9:26 AM, Ming Lei wrote:
quoted
On Fri, Jan 19, 2018 at 09:19:24AM -0700, Jens Axboe wrote:
quoted
On 1/19/18 9:05 AM, Ming Lei wrote:
quoted
On Fri, Jan 19, 2018 at 08:48:55AM -0700, Jens Axboe wrote:
quoted
On 1/19/18 8:40 AM, Ming Lei wrote:
quoted
quoted
quoted
quoted
Where does the dm STS_RESOURCE error usually come from - what's exact
resource are we running out of?
It is from blk_get_request(underlying queue), see
multipath_clone_and_map().
That's what I thought. So for a low queue depth underlying queue, it's
quite possible that this situation can happen. Two potential solutions
I see:

1) As described earlier in this thread, having a mechanism for being
   notified when the scarce resource becomes available. It would not
   be hard to tap into the existing sbitmap wait queue for that.

2) Have dm set BLK_MQ_F_BLOCKING and just sleep on the resource
   allocation. I haven't read the dm code to know if this is a
   possibility or not.

I'd probably prefer #1. It's a classic case of trying to get the
request, and if it fails, add ourselves to the sbitmap tag wait
queue head, retry, and bail if that also fails. Connecting the
scarce resource and the consumer is the only way to really fix
this, without bogus arbitrary delays.
Right, as I have replied to Bart, using mod_delayed_work_on() with
returning BLK_STS_NO_DEV_RESOURCE(or sort of name) for the scarce
resource should fix this issue.
It'll fix the forever stall, but it won't really fix it, as we'll slow
down the dm device by some random amount.

A simple test case would be to have a null_blk device with a queue depth
of one, and dm on top of that. Start a fio job that runs two jobs: one
that does IO to the underlying device, and one that does IO to the dm
device. If the job on the dm device runs substantially slower than the
one to the underlying device, then the problem isn't really fixed.
I remembered that I tried this test on scsi-debug & dm-mpath over scsi-debug,
seems not observed this issue, could you explain a bit why IO over dm-mpath
may be slower? Because both two IO contexts call same get_request(), and
in theory dm-mpath should be a bit quicker since it uses direct issue for
underlying queue, without io scheduler involved.
Because if you lose the race for getting the request, you'll have some
arbitrary delay before trying again, potentially. Compared to the direct
But the restart still works, one request is completed, then the queue
is return immediately because we use mod_delayed_work_on(0), so looks
no such issue.
There are no pending requests for this case, nothing to restart the
queue. When you fail that blk_get_request(), you are idle, nothing
is pending.
I think we needn't worry about that, once a device is attached to
dm-rq, it can't be mounted any more, and usually user don't use the device
directly and by dm-mpath at the same time.
Here's an example of that, using my current block tree (merged into
master).  The setup is dm-mpath on top of null_blk, the latter having
just a single request. Both are mq devices.

Fio direct 4k random reads on dm_mq: ~250K iops

Start dd on underlying device (or partition on same device), just doing
sequential reads.

Fio direct 4k random reads on dm_mq with dd running: 9 iops

No schedulers involved.

https://i.imgur.com/WTDnnwE.gif
This DM specific issue might be addressed by applying notifier_chain
(or similar mechanism)between the two queues, will think about the
details tomorrow.


-- 
Ming
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help