Re: [PATCH 1/1] [RFC] blk-mq: fix queue stalling on shared hctx restart
From: Roman Penyaev <hidden>
Date: 2017-10-20 09:39:48
Also in:
lkml
Hi Bart, On Thu, Oct 19, 2017 at 7:47 PM, Bart Van Assche [off-list ref] wrote:
On Wed, 2017-10-18 at 12:22 +0200, Roman Pen wrote:quoted
the patch below fixes queue stalling when shared hctx marked for restart (BLK_MQ_S_SCHED_RESTART bit) but q->shared_hctx_restart stays zero. The root cause is that hctxs are shared between queues, but 'shared_hctx_restart' belongs to the particular queue, which in fact may not need to be restarted, thus we return from blk_mq_sched_restart() and leave shared hctx of another queue never restarted. The fix is to make shared_hctx_restart counter belong not to the queue, but to tags, thereby counter will reflect real number of shared hctx needed to be restarted.Hello Roman, The patch you posted looks fine to me but seeing this patch and the patch description makes me wonder why this had not been noticed before.
This is a good question, which I could not answer. I tried to simulate the same behaviour (completion timings, completion pinning, number of submission queues, shared tags, etc) on null block. but what I see is that *_sched_restart() never observes 'shared_hctx_restart', literally never (I made a counter when we take a path and start looking for a hctx to restart, and a counter stays 0). That makes me nervous and then I gave up. After some time I want return to that and try to reproduce the problem on something else, say nvme.
Are you perhaps using a block driver that returns BLK_STS_RESOURCE more often than other block drivers? Did you perhaps run into this with the Infiniband network block device (IBNBD) driver?
Yep, this is IBNBD, but in these tests I tested with mq scheduler, shared tags and 1 hctx for each queue (blk device), thus I never run out of internal tags and never return BLK_STS_RESOURCE. Indeed, not modified IBNBD does internal tags management. This was needed because each queue (block device) was created with hctx number (nr_hw_queues) equal to number of cpus on the system, but blk-mq tags set is shared only between hctx, not globally, which led to need to return BLK_STS_RESOURCE and queues restarts. But, with mq scheduler situation changed: 1 hctx with shared tags can be specified for all hundreds of devices without any performance impact. Testing this configuration (1 hctx, shared tags, mq-deadline) immediately shows these two problems: request stalling and slow loops inside blk_mq_sched_restart().
No matter what driver triggered this, I think this bug should be fixed.
Yes, queue stalling can be easily fixed. I can resend current patch with
shorter description which targets only this particular bug, if no one else
has objections/comments etc.
But what bothers me is these looong loops inside blk_mq_sched_restart(),
and since you are the author of the original 6d8c6c0f97ad ("blk-mq: Restart
a single queue if tag sets are shared") I want to ask what was the original
problem which you attempted to fix? Likely I am missing some test scenario
which would be great to know about.
--
Roman