Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
From: Ming Lei <hidden>
Date: 2017-10-24 10:04:34
Also in:
lkml
On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote:
Hi Ming, On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei [off-list ref] wrote:quoted
On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:quoted
Hi all, the patch below fixes queue stalling when shared hctx marked for restart (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The root cause is that hctxs are shared between queues, but 'shared_hctx_restart' belongs to the particular queue, which in fact may not need to be restarted, thus we return from blk_mq_sched_restart() and leave shared hctx of another queue never restarted. The fix is to make shared_hctx_restart counter belong not to the queue, but to tags, thereby counter will reflect real number of shared hctx needed to be restarted. During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests were noticed in dd->fifo_list of mq-deadline scheduler. Seeming possible sequence of events: 1. Request A of queue A is inserted into dd->fifo_list of the scheduler. 2. Request B of queue A bypasses scheduler and goes directly to hctx->dispatch. 3. Request C of queue B is inserted. 4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not empty (request B is in the list) hctx is only marked for for next restart and request A is left in a list (see comment "So it's best to leave them there for as long as we can. Mark the hw queue as needing a restart in that case." in blk-mq-sched.c) 5. Eventually request B is completed/freed and blk_mq_sched_restart() is called, but by chance hctx from queue B is chosen for restart and request C gets a chance to be dispatched. 6. Eventually request C is completed/freed and blk_mq_sched_restart() is called, but shared_hctx_restart for queue B is zero and we return without attempt to restart hctx from queue A, thus request A is stuck forever. But stalling queue is not the only one problem with blk_mq_sched_restart(). My tests show that those loops thru all queues and hctxs can be very costly, even with shared_hctx_restart counter, which aims to fix performance issue. For my tests I create 128 devices with 64 hctx each, which share same tags set.Hi Roman, I also find the performance issue with RESTART for TAG_SHARED. But from my analysis, RESTART isn't needed for TAG_SHARED because SCSI-MQ has handled the RESTART by itself already, so could you test the patch in following link posted days ago to see if it fixes your issue?I can say without any testing: it fixes all the issues :) You've reverted 8e8320c9315c ("blk-mq: fix performance regression with shared tags") 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared") with one major difference: you do not handle shared tags in a special way and restart only requested hctx, instead of iterating over all hctxs in a queue.
Hi Roman, Yeah, that is unnecessary as I explained in detail in the commit log and introduces lots of overhead, and I can see IOPS is improved by 20%~30% in my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED. Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget(). I will post out V3 soon.
Firstly I have to say that queue stalling issue(#1) and performance issue(#2) were observed on our in-house rdma driver IBNBD: https://lwn.net/Articles/718181/ and I've never tried to reproduce them on SCSI-MQ.
OK, got it, then you can handle it in SCSI-MQ's way since this way is used by non-MQ for long time. Or you need to do nothing if your driver is SCSI based.
Then your patch brakes RR restarts, which were introduced by this commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
Seems basic idea of that patch is nice, but because of possible big
amount of queues and hctxs patch requires reimplementation. Eventually
you should get fast hctx restart but in RR fashion. According to my
understanding that does not contradict with your patch.Firstly this way has been run/verified for long time in non-mq, and no one complained it before. Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(), the sdev is added into the 'starved_list' in FIFO style, which is still fair. So I don't think it is an issue to remove the RR restarts. Thanks, Ming