Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

From: Ming Lei <hidden>
Date: 2017-10-24 10:04:34
Also in: lkml

On Mon, Oct 23, 2017 at 06:12:29PM +0200, Roman Penyaev wrote:

Hi Ming,

On Fri, Oct 20, 2017 at 3:39 PM, Ming Lei [off-list ref] wrote:

quoted

On Wed, Oct 18, 2017 at 12:22:06PM +0200, Roman Pen wrote:

quoted

Hi all,

the patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

During tests 1 hctx (set->nr_hw_queues) was used and all stalled requests
were noticed in dd->fifo_list of mq-deadline scheduler.

Seeming possible sequence of events:

1. Request A of queue A is inserted into dd->fifo_list of the scheduler.

2. Request B of queue A bypasses scheduler and goes directly to
   hctx->dispatch.

3. Request C of queue B is inserted.

4. blk_mq_sched_dispatch_requests() is invoked, since hctx->dispatch is not
   empty (request B is in the list) hctx is only marked for for next restart
   and request A is left in a list (see comment "So it's best to leave them
   there for as long as we can. Mark the hw queue as needing a restart in
   that case." in blk-mq-sched.c)

5. Eventually request B is completed/freed and blk_mq_sched_restart() is
   called, but by chance hctx from queue B is chosen for restart and request C
   gets a chance to be dispatched.

6. Eventually request C is completed/freed and blk_mq_sched_restart() is
   called, but shared_hctx_restart for queue B is zero and we return without
   attempt to restart hctx from queue A, thus request A is stuck forever.

But stalling queue is not the only one problem with blk_mq_sched_restart().
My tests show that those loops thru all queues and hctxs can be very costly,
even with shared_hctx_restart counter, which aims to fix performance issue.
For my tests I create 128 devices with 64 hctx each, which share same tags
set.

Hi Roman,

I also find the performance issue with RESTART for TAG_SHARED.

But from my analysis, RESTART isn't needed for TAG_SHARED
because SCSI-MQ has handled the RESTART by itself already, so
could you test the patch in following link posted days ago to
see if it fixes your issue?

I can say without any testing: it fixes all the issues :) You've
reverted

   8e8320c9315c ("blk-mq: fix performance regression with shared tags")
   6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")

with one major difference: you do not handle shared tags in a special
way and restart only requested hctx, instead of iterating over all hctxs
in a queue.

Hi Roman,

Yeah, that is unnecessary as I explained in detail in the commit log and
introduces lots of overhead, and I can see IOPS is improved by 20%~30% in
my SCSI_DEBUG test(only 8 luns) by removing the blk-mq restart for TAG_SHARED.

Also it isn't needed for BLK_STS_RESOURCE returned from .get_budget().

I will post out V3 soon.

Firstly I have to say that queue stalling issue(#1) and performance
issue(#2) were observed on our in-house rdma driver IBNBD:
https://lwn.net/Articles/718181/ and I've never tried to reproduce
them on SCSI-MQ.

OK, got it, then you can handle it in SCSI-MQ's way since this way
is used by non-MQ for long time. Or you need to do nothing if your
driver is SCSI based.

Then your patch brakes RR restarts, which were introduced by this commit
6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared").
Seems basic idea of that patch is nice, but because of possible big
amount of queues and hctxs patch requires reimplementation.  Eventually
you should get fast hctx restart but in RR fashion. According to my
understanding that does not contradict with your patch.

Firstly this way has been run/verified for long time in non-mq, and no one
complained it before.

Secondly please see scsi_target_queue_ready() and scsi_host_queue_ready(),
the sdev is added into the 'starved_list' in FIFO style, which is still fair.

So I don't think it is an issue to remove the RR restarts.

Thanks,
Ming

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help