Re: [PATCH 04/14] blk-mq-sched: improve dispatching from sw queue
From: Ming Lei <hidden>
Date: 2017-08-01 10:17:29
Also in:
linux-scsi
On Mon, Jul 31, 2017 at 11:34:35PM +0000, Bart Van Assche wrote:
On Tue, 2017-08-01 at 00:51 +0800, Ming Lei wrote:quoted
SCSI devices use host-wide tagset, and the shared driver tag space is often quite big. Meantime there is also queue depth for each lun(.cmd_per_lun), which is often small. So lots of requests may stay in sw queue, and we always flush all belonging to same hw queue and dispatch them all to driver, unfortunately it is easy to cause queue busy becasue of the small per-lun queue depth. Once these requests are flushed out, they have to stay in hctx->dispatch, and no bio merge can participate into these requests, and sequential IO performance is hurted. This patch improves dispatching from sw queue when there is per-request-queue queue depth by taking request one by one from sw queue, just like the way of IO scheduler. Signed-off-by: Ming Lei <redacted> --- block/blk-mq-sched.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-)diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index 47a25333a136..3510c01cb17b 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c@@ -96,6 +96,9 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) const bool has_sched_dispatch = e && e->type->ops.mq.dispatch_request; bool can_go = true; LIST_HEAD(rq_list); + struct request *(*dispatch_fn)(struct blk_mq_hw_ctx *) = + has_sched_dispatch ? e->type->ops.mq.dispatch_request : + blk_mq_dispatch_rq_from_ctxs; /* RCU or SRCU read lock is needed before checking quiesced flag */ if (unlikely(blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)))@@ -126,26 +129,28 @@ void blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx) if (!list_empty(&rq_list)) { blk_mq_sched_mark_restart_hctx(hctx); can_go = blk_mq_dispatch_rq_list(q, &rq_list); - } else if (!has_sched_dispatch) { + } else if (!has_sched_dispatch && !q->queue_depth) { blk_mq_flush_busy_ctxs(hctx, &rq_list); blk_mq_dispatch_rq_list(q, &rq_list); + can_go = false; } + if (!can_go) + return; + /* * We want to dispatch from the scheduler if we had no work left * on the dispatch list, OR if we did have work but weren't able * to make progress. */ - if (can_go && has_sched_dispatch) { - do { - struct request *rq; + do { + struct request *rq; - rq = e->type->ops.mq.dispatch_request(hctx); - if (!rq) - break; - list_add(&rq->queuelist, &rq_list); - } while (blk_mq_dispatch_rq_list(q, &rq_list)); - } + rq = dispatch_fn(hctx); + if (!rq) + break; + list_add(&rq->queuelist, &rq_list); + } while (blk_mq_dispatch_rq_list(q, &rq_list)); }Hello Ming, Although I like the idea behind this patch, I'm afraid that this patch will cause a performance regression for high-performance SCSI LLD drivers, e.g. ib_srp. Have you considered to rework this patch as follows: * Remove the code under "else if (!has_sched_dispatch && !q->queue_depth) {".
This will affect devices such as NVMe in which busy isn't triggered basically, so better to not do this.
* Modify all blk_mq_dispatch_rq_list() functions such that these dispatch up to cmd_per_lun - (number of requests in progress) at once.
How can we get the accurate 'number of requests in progress' efficiently? And we have done it in this way for blk-mq scheduler already, so it shouldn't be a problem.
From my test data of mq-deadline on lpfc, the performance is good,
please see it in cover letter. Thanks, Ming