Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart

From: Roman Penyaev <hidden>
Date: 2017-10-20 09:39:48
Also in: lkml

Hi Bart,

On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche [off-list ref] wrote:

On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:

quoted

the patch below fixes queue stalling when shared hctx marked for restart
(BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero.  The
root cause is that hctxs are shared between queues, but 'shared_hctx_restart'
belongs to the particular queue, which in fact may not need to be restarted,
thus we return from blk_mq_sched_restart() and leave shared hctx of another
queue never restarted.

The fix is to make shared_hctx_restart counter belong not to the queue, but
to tags, thereby counter will reflect real number of shared hctx needed to
be restarted.

Hello Roman,

The patch you posted looks fine to me but seeing this patch and the patch
description makes me wonder why this had not been noticed before.

This is a good question, which I could not answer.  I tried to simulate the
same behaviour (completion timings, completion pinning, number of submission
queues, shared tags, etc) on null block.  but what I see is that
*_sched_restart()
never observes 'shared_hctx_restart',  literally never (I made a counter when
we take a path and start looking for a hctx to restart, and a counter stays 0).

That makes me nervous and then I gave up.  After some time I want return to
that and try to reproduce the problem on something else, say nvme.

Are you perhaps using a block driver that returns BLK_STS_RESOURCE more
often than other block drivers? Did you perhaps run into this with the
Infiniband network block device (IBNBD) driver?

Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags
and 1 hctx for each queue (blk device),  thus I never run out of internal tags
and never return BLK_STS_RESOURCE.

Indeed, not modified IBNBD does internal tags management.  This was needed
because each queue (block device) was created with hctx number (nr_hw_queues)
equal to number of cpus on the system, but blk-mq tags set is shared only
between hctx, not globally, which led to need to return BLK_STS_RESOURCE
and queues restarts.

But, with mq scheduler situation changed: 1 hctx with shared tags can be
specified for all hundreds of devices without any performance impact.

Testing this configuration (1 hctx, shared tags, mq-deadline) immediately
shows these two problems: request stalling and slow loops inside
blk_mq_sched_restart().

No matter what driver triggered this, I think this bug should be fixed.

Yes, queue stalling can be easily fixed.  I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.

But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix?  Likely I am missing some test scenario
which would be great to know about.

--
Roman

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help