Re: [PATCH V7 4/6] blk-mq: introduce .get_budget and .put_budget in blk_mq_ops

From: Jens Axboe <axboe@kernel.dk>
Date: 2017-10-13 16:28:06
Also in: linux-scsi, lkml
Subsystem: block layer, the rest · Maintainers: Jens Axboe, Linus Torvalds

On 10/13/2017 10:21 AM, Ming Lei wrote:

On Fri, Oct 13, 2017 at 10:19:04AM -0600, Jens Axboe wrote:

quoted

On 10/13/2017 10:07 AM, Ming Lei wrote:

quoted

On Fri, Oct 13, 2017 at 08:44:23AM -0600, Jens Axboe wrote:

quoted

On 10/12/2017 06:19 PM, Ming Lei wrote:

quoted

On Thu, Oct 12, 2017 at 12:46:24PM -0600, Jens Axboe wrote:

quoted

On 10/12/2017 12:37 PM, Ming Lei wrote:

quoted

For SCSI devices, there is often per-request-queue depth, which need
to be respected before queuing one request.

The current blk-mq always dequeues one request first, then calls .queue_rq()
to dispatch the request to lld. One obvious issue of this way is that I/O
merge may not be good, because when the per-request-queue depth can't be
respected,  .queue_rq() has to return BLK_STS_RESOURCE, then this request
has to staty in hctx->dispatch list, and never got chance to participate
into I/O merge.

This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
then we can try to get reserved budget first before dequeuing request.
Once we can't get budget for queueing I/O, we don't need to dequeue request
at all, then I/O merge can get improved a lot.

I can't help but think that it would be cleaner to just be able to
reinsert the request into the scheduler properly, if we fail to
dispatch it. Bart hinted at that earlier as well.

Actually when I start to investigate the issue, the 1st thing I tried
is to reinsert, but that way is even worse on qla2xxx.

Once request is dequeued, the IO merge chance is decreased a lot.
With none scheduler, it becomes not possible to merge because
we only try to merge over the last 8 requests. With mq-deadline,
when one request is reinserted, another request may be dequeued
at the same time.

I don't care too much about 'none'. If perfect merging is crucial for
getting to the performance level you want on the hardware you are using,
you should not be using 'none'. 'none' will work perfectly fine for NVMe
etc style devices, where we are not dependent on merging to the same
extent that we are on other devices.

mq-deadline reinsertion will be expensive, that's in the nature of that
beast. It's basically the same as a normal request inserition.  So for
that, we'd have to be a bit careful not to run into this too much. Even
with a dumb approach, it should only happen 1 out of N times, where N is
the typical point at which the device will return STS_RESOURCE. The
reinsertion vs dequeue should be serialized with your patch to do that,
at least for the single queue mq-deadline setup. In fact, I think your
approach suffers from that same basic race, in that the budget isn't a
hard allocation, it's just a hint. It can change from the time you check
it, and when you go and dispatch the IO, if you don't serialize that
part. So really should be no different in that regard.

In case of SCSI, the .get_buget is done as atomic counting,
and it is completely effective to avoid unnecessary dequeue, please take
a look at patch 6.

Looks like you are right, I had initially misread that as just checking
the busy count. But you are actually getting the count at that point,
so it should be solid.

quoted

Not mention the cost of acquiring/releasing lock, that work
is just doing useless work and wasting CPU.

Sure, my point is that if it doesn't happen too often, it doesn't really
matter. It's not THAT expensive.

Actually it is in hot path, for example, lpfc and qla2xx's queue depth is 3,
it is quite easy to trigger STS_RESOURCE.

Ugh, that is low.

OK, I think we should just roll with this and see how far we can go. I'll
apply it for 4.15.

OK, I have some update, will post a new version soon.

Fold something like this into it then:

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index ccbbc7e108ea..b7bf84b5ddf2 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c

@@ -102,13 +102,12 @@ static bool blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
 				!e->type->ops.mq.has_work(hctx))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = e->type->ops.mq.dispatch_request(hctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);

@@ -140,13 +139,12 @@ static bool blk_mq_do_dispatch_ctx(struct blk_mq_hw_ctx *hctx)
 		if (!sbitmap_any_bit_set(&hctx->ctx_map))
 			break;
 
-		if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx))
+		if (!blk_mq_get_dispatch_budget(hctx))
 			return true;
 
 		rq = blk_mq_dequeue_from_ctx(hctx, ctx);
 		if (!rq) {
-			if (q->mq_ops->put_budget)
-				q->mq_ops->put_budget(hctx);
+			blk_mq_put_dispatch_budget(hctx, true);
 			break;
 		}
 		list_add(&rq->queuelist, &rq_list);

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 255a705f8672..008c975b6f4b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c

@@ -1087,14 +1087,6 @@ static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 	return true;
 }
 
-static void blk_mq_put_budget(struct blk_mq_hw_ctx *hctx, bool got_budget)
-{
-	struct request_queue *q = hctx->queue;
-
-	if (q->mq_ops->put_budget && got_budget)
-		q->mq_ops->put_budget(hctx);
-}
-
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 		bool got_budget)
 {

@@ -1125,7 +1117,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * rerun the hardware queue when a tag is freed.
 			 */
 			if (!blk_mq_dispatch_wait_add(hctx)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}

@@ -1135,16 +1127,13 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
 			 * hardware queue to the wait queue.
 			 */
 			if (!blk_mq_get_driver_tag(rq, &hctx, false)) {
-				blk_mq_put_budget(hctx, got_budget);
+				blk_mq_put_dispatch_budget(hctx, got_budget);
 				break;
 			}
 		}
 
-		if (!got_budget) {
-			if (q->mq_ops->get_budget &&
-					!q->mq_ops->get_budget(hctx))
-				break;
-		}
+		if (!got_budget && !blk_mq_get_dispatch_budget(hctx))
+			break;
 
 		list_del_init(&rq->queuelist);

@@ -1642,7 +1631,7 @@ static void __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
 	if (!blk_mq_get_driver_tag(rq, NULL, false))
 		goto insert;
 
-	if (q->mq_ops->get_budget && !q->mq_ops->get_budget(hctx)) {
+	if (!blk_mq_get_dispatch_budget(hctx)) {
 		blk_mq_put_driver_tag(rq);
 		goto insert;
 	}

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 4d12ef08b0a9..9a1426e8b6e5 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h

@@ -139,4 +139,23 @@ static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
 			unsigned int inflight[2]);
 
+static inline bool blk_mq_get_dispatch_budget(struct blk_mq_hw_ctx *hctx)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (!q->mq_ops->get_budget)
+		return true;
+
+	return q->mq_ops->get_budget(hctx);
+}
+
+static inline void blk_mq_put_dispatch_budget(struct blk_mq_hw_ctx *hctx,
+					      bool got_budget)
+{
+	struct request_queue *q = hctx->queue;
+
+	if (got_budget && q->mq_ops->put_budget)
+		q->mq_ops->put_budget(hctx);
+}
+
 #endif

-- 
Jens Axboe

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help