Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an... | linux-block

Re: [PATCH RFC 01/14] block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler

From: Bart Van Assche <hidden>
Date: 2017-03-06 19:44:53
Also in: lkml

On 03/04/2017 08:01 AM, Paolo Valente wrote:=0A=

BFQ is a proportional-share I/O scheduler, whose general structure,=0A=
plus a lot of code, are borrowed from CFQ.=0A=
[ ... ]=0A=

=0A=
This description is very useful. However, since it is identical to the=0A=
description this patch adds to Documentation/block/bfq-iosched.txt I=0A=
propose to leave it out from the patch description.=0A=
=0A=
What seems missing to me is an overview of the limitations of BFQ. Does=0A=
BFQ e.g. support multiple hardware queues?=0A=
=0A=

+3. What are BFQ's tunable?=0A=
+=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=

=3D=3D=0A=

+[ ... ]=0A=

=0A=
A thorough knowledge of BFQ is required to tune it properly. Users don't=0A=
want to tune I/O schedulers. Has it been considered to invent algorithms=0A=
to tune these parameters automatically?=0A=
=0A=

+ * Licensed under GPL-2.=0A=

=0A=
The COPYING file at the top of the tree mentions that GPL-v2 licensing=0A=
should be specified as follows close to the start of each source file:=0A=
=0A=
    This program is free software; you can redistribute it and/or modify=0A=
    it under the terms of the GNU General Public License as published by=0A=
    the Free Software Foundation; either version 2 of the License, or=0A=
    (at your option) any later version.=0A=
=0A=
    This program is distributed in the hope that it will be useful,=0A=
    but WITHOUT ANY WARRANTY; without even the implied warranty of=0A=
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the=0A=
    GNU General Public License for more details.=0A=
=0A=
    You should have received a copy of the GNU General Public License=0A=
    along with this program; if not, write to the Free Software=0A=
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA=0A=
    02110-1301 USA=0A=
=0A=

+ * BFQ is a proportional-share I/O scheduler, with some extra=0A=
+ * low-latency capabilities. BFQ also supports full hierarchical=0A=
+ * scheduling through cgroups. Next paragraphs provide an introduction=

=0A=

+ * on BFQ inner workings. Details on BFQ benefits and usage can be=0A=
+ * found in Documentation/block/bfq-iosched.txt.=0A=

=0A=
That reference should be sufficient - please do not duplicate=0A=
Documentation/block/bfq-iosched.txt in block/bfq-iosched.c.=0A=
=0A=

+/**=0A=
+ * struct bfq_service_tree - per ioprio_class service tree.=0A=
+ *=0A=
+ * Each service tree represents a B-WF2Q+ scheduler on its own.  Each=0A=
+ * ioprio_class has its own independent scheduler, and so its own=0A=
+ * bfq_service_tree.  All the fields are protected by the queue lock=0A=
+ * of the containing bfqd.=0A=
+ */=0A=
+struct bfq_service_tree {=0A=
+	/* tree for active entities (i.e., those backlogged) */=0A=
+	struct rb_root active;=0A=
+	/* tree for idle entities (i.e., not backlogged, with V <=3D F_i)*/=0A=
+	struct rb_root idle;=0A=
+=0A=
+	struct bfq_entity *first_idle;	/* idle entity with minimum F_i */=0A=
+	struct bfq_entity *last_idle;	/* idle entity with maximum F_i */=0A=
+=0A=
+	u64 vtime; /* scheduler virtual time */=0A=
+	/* scheduler weight sum; active and idle entities contribute to it */=

=0A=

+	unsigned long wsum;=0A=
+};=0A=

=0A=
Inline comments next to structure members are ugly and make the=0A=
structure definition hard to read. Please follow the instructions in=0A=
Documentation/kernel-doc-nano-HOWTO.txt for documenting structure members.=
=0A=
=0A=

+	u64 finish; /* B-WF2Q+ finish timestamp (aka F_i) */=0A=
+	u64 start;  /* B-WF2Q+ start timestamp (aka S_i) */=0A=

=0A=
For all times and timestamps, please document the time unit (e.g. s, ms,=0A=
us, ns, jiffies, ...).=0A=
=0A=

+enum bfq_device_speed {=0A=
+	BFQ_BFQD_FAST,=0A=
+	BFQ_BFQD_SLOW,=0A=
+};=0A=

=0A=
What is the meaning of "fast" and "slow" devices in this context?=0A=
Anyway, since the first patch that uses this enum is patch 6, please=0A=
defer introduction of this enum until patch 6.=0A=
=0A=

+=0A=
+/**=0A=
+ * struct bfq_data - per-device data structure.=0A=
+ *=0A=
+ * All the fields are protected by @lock.=0A=
+ */=0A=
+struct bfq_data {=0A=
+	/* device request queue */=0A=
+	struct request_queue *queue;=0A=
+	[ ... ]=0A=
+=0A=
+	/* on-disk position of the last served request */=0A=
+	sector_t last_position;=0A=

=0A=
What is the relevance of last_position if there are multiple hardware=0A=
queues? Will the BFQ algorithm fail to realize its guarantees in that case?=
=0A=
=0A=
What is the relevance of this structure member for block devices that=0A=
have multiple spindles, e.g. arrays of hard disks?=0A=
=0A=

+enum bfqq_state_flags {=0A=
+	BFQ_BFQQ_FLAG_busy =3D 0,		/* has requests or is in service */=0A=
+	BFQ_BFQQ_FLAG_wait_request,	/* waiting for a request */=0A=
+	BFQ_BFQQ_FLAG_non_blocking_wait_rq, /*=0A=
+					     * waiting for a request=0A=
+					     * without idling the device=0A=
+					     */=0A=
+	BFQ_BFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */=0A=
+	BFQ_BFQQ_FLAG_idle_window,	/* slice idling enabled */=0A=
+	BFQ_BFQQ_FLAG_sync,		/* synchronous queue */=0A=
+	BFQ_BFQQ_FLAG_budget_new,	/* no completion with this budget */=0A=
+	BFQ_BFQQ_FLAG_IO_bound,		/*=0A=
+					 * bfqq has timed-out at least once=0A=
+					 * having consumed at most 2/10 of=0A=
+					 * its budget=0A=
+					 */=0A=
+};=0A=

=0A=
The "BFQ_BFQQ_FLAG_" prefix looks silly and too long to me. How about=0A=
e.g. using the prefix "BFQQF_" instead?=0A=
=0A=

+#define BFQ_BFQQ_FNS(name)						\=0A=
+static void bfq_mark_bfqq_##name(struct bfq_queue *bfqq)		\=0A=
+{									\=0A=
+	(bfqq)->flags |=3D (1 << BFQ_BFQQ_FLAG_##name);			\=0A=
+}									\=0A=
+static void bfq_clear_bfqq_##name(struct bfq_queue *bfqq)		\=0A=
+{									\=0A=
+	(bfqq)->flags &=3D ~(1 << BFQ_BFQQ_FLAG_##name);			\=0A=
+}									\=0A=
+static int bfq_bfqq_##name(const struct bfq_queue *bfqq)		\=0A=
+{									\=0A=
+	return ((bfqq)->flags & (1 << BFQ_BFQQ_FLAG_##name)) !=3D 0;	\=0A=
+}=0A=

=0A=
Are the bodies of the above functions duplicates of __set_bit(),=0A=
__clear_bit() and test_bit()?=0A=
=0A=

+/* Expiration reasons. */=0A=
+enum bfqq_expiration {=0A=
+	BFQ_BFQQ_TOO_IDLE =3D 0,		/*=0A=
+					 * queue has been idling for=0A=
+					 * too long=0A=
+					 */=0A=
+	BFQ_BFQQ_BUDGET_TIMEOUT,	/* budget took too long to be used */=0A=
+	BFQ_BFQQ_BUDGET_EXHAUSTED,	/* budget consumed */=0A=
+	BFQ_BFQQ_NO_MORE_REQUESTS,	/* the queue has no more requests */=0A=
+	BFQ_BFQQ_PREEMPTED		/* preemption in progress */=0A=
+};=0A=

=0A=
The prefix of these constants refers twice to "BFQ" and does not make it=0A=
clear that these constants are about expiration. How about using the=0A=
"BFQQE_" prefix instead?=0A=
=0A=

+/* Maximum backwards seek, in KiB. */=0A=
+static const int bfq_back_max =3D 16 * 1024;=0A=

=0A=
Where does this constant come from? Should it depend on geometry data=0A=
like e.g. the number of sectors in a cylinder?=0A=
=0A=

+#define for_each_entity(entity)	\=0A=
+	for (; entity ; entity =3D NULL)=0A=

=0A=
Why has this confusing #define been introduced? Shouldn't all=0A=
occurrences of this macro be changed into the equivalent "if (entity)"?=0A=
We don't want silly macros like this in the Linux kernel.=0A=
=0A=

+#define for_each_entity_safe(entity, parent) \=0A=
+	for (parent =3D NULL; entity ; entity =3D parent)=0A=

=0A=
Same question here - why has this macro been introduced and how has its=0A=
name been chosen? Since this macro is used only once and since no value=0A=
is assigned to 'parent' in the code controlled by this construct, please=0A=
remove this macro and use something that is less confusing than a "for"=0A=
loop for something that is not a loop.=0A=
=0A=

+/**=0A=
+ * bfq_weight_to_ioprio - calc an ioprio from a weight.=0A=
+ * @weight: the weight value to convert.=0A=
+ *=0A=
+ * To preserve as much as possible the old only-ioprio user interface,=

=0A=

+ * 0 is used as an escape ioprio value for weights (numerically) equal o=

r=0A=

+ * larger than IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF.=0A=
+ */=0A=
+static unsigned short bfq_weight_to_ioprio(int weight)=0A=
+{=0A=
+	return IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight < 0 ?=0A=
+		0 : IOPRIO_BE_NR * BFQ_WEIGHT_CONVERSION_COEFF - weight;=0A=
+}=0A=

=0A=
Please consider using max() or max_t() to make this function less verbose.=
=0A=
=0A=

+=0A=
+/**=0A=
+ * bfq_active_extract - remove an entity from the active tree.=0A=
+ * @st: the service_tree containing the tree.=0A=
+ * @entity: the entity being removed.=0A=
+ */=0A=
+static void bfq_active_extract(struct bfq_service_tree *st,=0A=
+			       struct bfq_entity *entity)=0A=
+{=0A=
+	struct bfq_queue *bfqq =3D bfq_entity_to_bfqq(entity);=0A=
+	struct rb_node *node;=0A=
+=0A=
+	node =3D bfq_find_deepest(&entity->rb_node);=0A=
+	bfq_extract(&st->active, entity);=0A=
+=0A=
+	if (node)=0A=
+		bfq_update_active_tree(node);=0A=
+=0A=
+	if (bfqq)=0A=
+		list_del(&bfqq->bfqq_list);=0A=
+}=0A=

=0A=
Which locks protect the data structures manipulated by this and other=0A=
functions? Have you considered to use lockdep_assert_held() to document=0A=
these assumptions?=0A=
=0A=

+	case (BFQ_RQ1_WRAP|BFQ_RQ2_WRAP): /* both rqs wrapped */=0A=

=0A=
Please don't use parentheses if no confusion is possible. Additionally,=0A=
checkpatch should have requested you to insert a space before and after=0A=
the logical or operator.=0A=
=0A=

+static void __bfq_set_in_service_queue(struct bfq_data *bfqd,=0A=
+				       struct bfq_queue *bfqq)=0A=
+{=0A=
+	if (bfqq) {=0A=
+		bfq_mark_bfqq_budget_new(bfqq);=0A=
+		bfq_clear_bfqq_fifo_expire(bfqq);=0A=
+=0A=
+		bfqd->budgets_assigned =3D (bfqd->budgets_assigned*7 + 256) / 8;=0A=

=0A=
Checkpatch should have asked you to insert spaces around the=0A=
multiplication operator.=0A=
=0A=

+/*=0A=
+ * bfq_default_budget - return the default budget for @bfqq on @bfqd.=0A=
+ * @bfqd: the device descriptor.=0A=
+ * @bfqq: the queue to consider.=0A=
+ *=0A=
+ * We use 3/4 of the @bfqd maximum budget as the default value=0A=
+ * for the max_budget field of the queues.  This lets the feedback=0A=
+ * mechanism to start from some middle ground, then the behavior=0A=
+ * of the process will drive the heuristics towards high values, if=0A=
+ * it behaves as a greedy sequential reader, or towards small values=0A=
+ * if it shows a more intermittent behavior.=0A=
+ */=0A=
+static unsigned long bfq_default_budget(struct bfq_data *bfqd,=0A=
+					struct bfq_queue *bfqq)=0A=
+{=0A=
+	unsigned long budget;=0A=
+=0A=
+	/*=0A=
+	 * When we need an estimate of the peak rate we need to avoid=0A=
+	 * to give budgets that are too short due to previous measurements.=0A=
+	 * So, in the first 10 assignments use a ``safe'' budget value.=0A=
+	 */=0A=
+	if (bfqd->budgets_assigned < 194 && bfqd->bfq_user_max_budget =3D=3D 0)=

=0A=

+		budget =3D bfq_default_max_budget;=0A=
+	else=0A=
+		budget =3D bfqd->bfq_max_budget;=0A=
+=0A=
+	return budget - budget / 4;=0A=
+}=0A=

=0A=
Where does the magic constant "194" come from?=0A=
=0A=
=0A=

+	} else=0A=
+		/*=0A=
+		 * Async queues get always the maximum possible=0A=
+		 * budget, as for them we do not care about latency=0A=
+		 * (in addition, their ability to dispatch is limited=0A=
+		 * by the charging factor).=0A=
+		 */=0A=
+		budget =3D bfqd->bfq_max_budget;=0A=
+=0A=

=0A=
Please balance braces. Checkpatch should have warned about the use of "}=0A=
else" instead of "} else {".=0A=
=0A=

+static unsigned long bfq_calc_max_budget(u64 peak_rate, u64 timeout)=0A=
+{=0A=
+	unsigned long max_budget;=0A=
+=0A=
+	/*=0A=
+	 * The max_budget calculated when autotuning is equal to the=0A=
+	 * amount of sectors transferred in timeout at the=0A=
+	 * estimated peak rate.=0A=
+	 */=0A=
+	max_budget =3D (unsigned long)(peak_rate * 1000 *=0A=
+				     timeout >> BFQ_RATE_SHIFT);=0A=
+=0A=
+	return max_budget;=0A=
+}=0A=

=0A=
Where does the constant 1000 come from? What are the units of peak_rate=0A=
and timeout? What is the maximum value of peak_rate? Can the=0A=
multiplication overflow?=0A=
=0A=

+/*=0A=
+ * In addition to updating the peak rate, checks whether the process=0A=
+ * is "slow", and returns 1 if so. This slow flag is used, in addition=

=0A=

+ * to the budget timeout, to reduce the amount of service provided to=0A=
+ * seeky processes, and hence reduce their chances to lower the=0A=
+ * throughput. See the code for more details.=0A=
+ */=0A=
+static bool bfq_update_peak_rate(struct bfq_data *bfqd, struct bfq_queue=

 *bfqq,=0A=

+				 bool compensate)=0A=
+{=0A=
+	u64 bw, usecs, expected, timeout;=0A=
+	ktime_t delta;=0A=
+	int update =3D 0;=0A=
+=0A=
+	if (!bfq_bfqq_sync(bfqq) || bfq_bfqq_budget_new(bfqq))=0A=
+		return false;=0A=
+=0A=
+	if (compensate)=0A=
+		delta =3D bfqd->last_idling_start;=0A=
+	else=0A=
+		delta =3D ktime_get();=0A=
+	delta =3D ktime_sub(delta, bfqd->last_budget_start);=0A=
+	usecs =3D ktime_to_us(delta);=0A=
+=0A=
+	/* Don't trust short/unrealistic values. */=0A=
+	if (usecs < 100 || usecs >=3D LONG_MAX)=0A=
+		return false;=0A=

=0A=
If usecs >=3D LONG_MAX that indicates a kernel bug. Please consider=0A=
triggering a kernel warning in that case.=0A=
=0A=

+/*=0A=
+ * Budget timeout is not implemented through a dedicated timer, but=0A=
+ * just checked on request arrivals and completions, as well as on=0A=
+ * idle timer expirations.=0A=
+ */=0A=
+static bool bfq_bfqq_budget_timeout(struct bfq_queue *bfqq)=0A=
+{=0A=
+	if (bfq_bfqq_budget_new(bfqq) ||=0A=
+	    time_before(jiffies, bfqq->budget_timeout))=0A=
+		return false;=0A=
+	return true;=0A=
+}=0A=

=0A=
Have you considered to use time_is_after_jiffies() instead of=0A=
time_before(jiffies, ...)?=0A=
=0A=

+static void bfq_init_bfqq(struct bfq_data *bfqd, struct bfq_queue *bfqq,=

=0A=

+			  struct bfq_io_cq *bic, pid_t pid, int is_sync)=0A=
+{=0A=
+	RB_CLEAR_NODE(&bfqq->entity.rb_node);=0A=
+	INIT_LIST_HEAD(&bfqq->fifo);=0A=
+=0A=
+	bfqq->ref =3D 0;=0A=
+	bfqq->bfqd =3D bfqd;=0A=
+=0A=
+	if (bic)=0A=
+		bfq_set_next_ioprio_data(bfqq, bic);=0A=
+=0A=
+	if (is_sync) {=0A=
+		if (!bfq_class_idle(bfqq))=0A=
+			bfq_mark_bfqq_idle_window(bfqq);=0A=
+		bfq_mark_bfqq_sync(bfqq);=0A=
+	} else=0A=
+		bfq_clear_bfqq_sync(bfqq);=0A=
+=0A=
+	bfqq->ttime.last_end_request =3D ktime_get_ns() - (1ULL<<32);=0A=
+=0A=
+	bfq_mark_bfqq_IO_bound(bfqq);=0A=
+=0A=
+	bfqq->pid =3D pid;=0A=
+=0A=
+	/* Tentative initial value to trade off between thr and lat */=0A=
+	bfqq->max_budget =3D bfq_default_budget(bfqd, bfqq);=0A=
+	bfqq->budget_timeout =3D bfq_smallest_from_now();=0A=
+	bfqq->pid =3D pid;=0A=
+=0A=
+	/* first request is almost certainly seeky */=0A=
+	bfqq->seek_history =3D 1;=0A=
+}=0A=

=0A=
What is the meaning of the 1ULL << 32 constant?=0A=
=0A=

+static int __init bfq_init(void)=0A=
+{=0A=
+	int ret;=0A=
+	char msg[50] =3D "BFQ I/O-scheduler: v0";=0A=

=0A=
Please leave out "[50]" and use "static const char" instead of "char".=0A=
=0A=

quoted hunk ↗ jump to hunk

diff --git a/block/elevator.c b/block/elevator.c=0A=
index 01139f5..786fdcd 100644=0A=
--- a/block/elevator.c=0A=
+++ b/block/elevator.c=0A=

@@ -221,14 +221,20 @@ int elevator_init(struct request_queue *q, char *na=

me)=0A=

 =0A=
 	if (!e) {=0A=
 		/*=0A=
-		 * For blk-mq devices, we default to using mq-deadline,=0A=
-		 * if available, for single queue devices. If deadline=0A=
-		 * isn't available OR we have multiple queues, default=0A=
-		 * to "none".=0A=
+		 * For blk-mq devices, we default to using bfq, if=0A=
+		 * available, for single queue devices. If bfq isn't=0A=
+		 * available, we try mq-deadline. If neither is=0A=
+		 * available, OR we have multiple queues, default to=0A=
+		 * "none".=0A=
 		 */=0A=
 		if (q->mq_ops) {=0A=
+			if (q->nr_hw_queues =3D=3D 1) {=0A=
+				e =3D elevator_get("bfq", false);=0A=
+				if (!e)=0A=
+					e =3D elevator_get("mq-deadline", false);=0A=
+			}=0A=
 			if (q->nr_hw_queues =3D=3D 1)=0A=
-				e =3D elevator_get("mq-deadline", false);=0A=
+				e =3D elevator_get("bfq", false);=0A=
 			if (!e)=0A=
 				return 0;=0A=
 		} else=0A=
=0A=

=0A=
As Jens wrote, it's way too early to make BFQ the default scheduler.=0A=
=0A=
Bart.=0A=

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help