Thread (30 messages) 30 messages, 4 authors, 2017-07-04

Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

From: Jens Axboe <axboe@kernel.dk>
Date: 2017-06-29 18:42:01
Also in: dm-devel
Subsystem: block layer, device-mapper (lvm), the rest · Maintainers: Jens Axboe, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Linus Torvalds

On 06/29/2017 10:00 AM, Jens Axboe wrote:
On 06/29/2017 09:58 AM, Jens Axboe wrote:
quoted
On 06/29/2017 02:40 AM, Ming Lei wrote:
quoted
On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe [off-list ref] wrote:
quoted
On 06/28/2017 03:12 PM, Brian King wrote:
quoted
This patch converts the in_flight counter in struct hd_struct from a
pair of atomics to a pair of percpu counters. This eliminates a couple
of atomics from the hot path. When running this on a Power system, to
a single null_blk device with 80 submission queues, irq mode 0, with
80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.
This has been done before, but I've never really liked it. The reason is
that it means that reading the part stat inflight count now has to
iterate over every possible CPU. Did you use partitions in your testing?
How many CPUs were configured? When I last tested this a few years ago
on even a quad core nehalem (which is notoriously shitty for cross-node
latencies), it was a net loss.
One year ago, I saw null_blk's IOPS can be decreased to 10%
of non-RQF_IO_STAT on a dual socket ARM64(each CPU has
96 cores, and dual numa nodes) too, the performance can be
recovered basically if per numa-node counter is introduced and
used in this case, but the patch was never posted out.
If anyone is interested in that, I can rebase the patch on current
block tree and post out. I guess the performance issue might be
related with system cache coherency implementation more or less.
This issue on ARM64 can be observed with the following userspace
atomic counting test too:

       http://kernel.ubuntu.com/~ming/test/cache/
How well did the per-node thing work? Doesn't seem to me like it would
go far enough. And per CPU is too much. One potential improvement would
be to change the part_stat_read() to just loop online CPUs, instead of
all possible CPUs. When CPUs go on/offline, use that as the slow path to
ensure the stats are sane. Often there's a huge difference between
NR_CPUS configured and what the system has. As Brian states, RH ships
with 2048, while I doubt a lot of customers actually run that...

Outside of coming up with a more clever data structure that is fully
CPU topology aware, one thing that could work is just having X cache
line separated read/write inflight counters per node, where X is some
suitable value (like 4). That prevents us from having cross node
traffic, and it also keeps the cross cpu traffic fairly low. That should
provide a nice balance between cost of incrementing the inflight
counting, and the cost of looping for reading it.

And that brings me to the next part...
quoted
quoted
I do agree that we should do something about it, and it's one of those
items I've highlighted in talks about blk-mq on pending issues to fix
up. It's just not great as it currently stands, but I don't think per
CPU counters is the right way to fix it, at least not for the inflight
counter.
Yeah, it won't be a issue for non-mq path, and for blk-mq path, maybe
we can use some blk-mq knowledge(tagset?) to figure out the
'in_flight' counter. I thought about it before, but never got a
perfect solution, and looks it is a bit hard, :-)
The tags are already a bit spread out, so it's worth a shot. That would
remove the need to do anything in the inc/dec path, as the tags already
do that. The inlight count could be easily retrieved with
sbitmap_weight(). The only issue here is that we need separate read and
write counters, and the weight would obviously only get us the total
count. But we can have a slower path for that, just iterate the tags and
count them. The fast path only cares about total count.

Let me try that out real quick.
Well, that only works for whole disk stats, of course... There's no way
around iterating the tags and checking for this to truly work.
Totally untested proof of concept for using the tags for this. I based
this on top of Brian's patch, so it includes his patch plus the
_double() stuff I did which is no longer really needed.

diff --git a/block/bio.c b/block/bio.c
index 9cf98b29588a..ec99d9ba0f33 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1737,7 +1737,7 @@ void generic_start_io_acct(int rw, unsigned long sectors,
 	part_round_stats(cpu, part);
 	part_stat_inc(cpu, part, ios[rw]);
 	part_stat_add(cpu, part, sectors[rw], sectors);
-	part_inc_in_flight(part, rw);
+	part_inc_in_flight(cpu, part, rw);
 
 	part_stat_unlock();
 }
@@ -1751,7 +1751,7 @@ void generic_end_io_acct(int rw, struct hd_struct *part,
 
 	part_stat_add(cpu, part, ticks[rw], duration);
 	part_round_stats(cpu, part);
-	part_dec_in_flight(part, rw);
+	part_dec_in_flight(cpu, part, rw);
 
 	part_stat_unlock();
 }
diff --git a/block/blk-core.c b/block/blk-core.c
index af393d5a9680..6ab2efbe940b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2434,8 +2434,13 @@ void blk_account_io_done(struct request *req)
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
-		part_round_stats(cpu, part);
-		part_dec_in_flight(part, rw);
+
+		if (req->q->mq_ops)
+			part_round_stats_mq(req->q, cpu, part);
+		else {
+			part_round_stats(cpu, part);
+			part_dec_in_flight(cpu, part, rw);
+		}
 
 		hd_struct_put(part);
 		part_stat_unlock();
@@ -2492,8 +2497,12 @@ void blk_account_io_start(struct request *rq, bool new_io)
 			part = &rq->rq_disk->part0;
 			hd_struct_get(part);
 		}
-		part_round_stats(cpu, part);
-		part_inc_in_flight(part, rw);
+		if (rq->q->mq_ops)
+			part_round_stats_mq(rq->q, cpu, part);
+		else {
+			part_round_stats(cpu, part);
+			part_inc_in_flight(cpu, part, rw);
+		}
 		rq->part = part;
 	}
 
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 99038830fb42..3b5eb2d4b964 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -634,7 +634,7 @@ static void blk_account_io_merge(struct request *req)
 		part = req->part;
 
 		part_round_stats(cpu, part);
-		part_dec_in_flight(part, rq_data_dir(req));
+		part_dec_in_flight(cpu, part, rq_data_dir(req));
 
 		hd_struct_put(part);
 		part_stat_unlock();
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index d0be72ccb091..a7b897740c47 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
 		bitnr += tags->nr_reserved_tags;
 	rq = tags->rqs[bitnr];
 
-	if (rq->q == hctx->queue)
+	if (rq && rq->q == hctx->queue)
 		iter_data->fn(hctx, rq, iter_data->data, reserved);
 	return true;
 }
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 05dfa3f270ae..cad4d2c26285 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -43,6 +43,58 @@ static LIST_HEAD(all_q_list);
 static void blk_mq_poll_stats_start(struct request_queue *q);
 static void blk_mq_poll_stats_fn(struct blk_stat_callback *cb);
 
+struct mq_inflight {
+	struct hd_struct *part;
+	unsigned int inflight;
+};
+
+static void blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
+				  struct request *rq, void *priv,
+				  bool reserved)
+{
+	struct mq_inflight *mi = priv;
+
+	if (rq->part == mi->part &&
+	    test_bit(REQ_ATOM_STARTED, &rq->atomic_flags))
+		mi->inflight++;
+}
+
+unsigned long part_in_flight_mq(struct request_queue *q,
+				struct hd_struct *part)
+{
+	struct mq_inflight mi = { .part = part, .inflight = 0 };
+
+	blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+	return mi.inflight;
+}
+
+static void __part_round_stats_mq(struct request_queue *q, int cpu,
+				  struct hd_struct *part, unsigned long now)
+{
+	unsigned long inflight;
+
+	if (now == part->stamp)
+		return;
+
+	inflight = part_in_flight_mq(q, part);
+	if (inflight) {
+		__part_stat_add(cpu, part, time_in_queue,
+				inflight * (now - part->stamp));
+		__part_stat_add(cpu, part, io_ticks, (now - part->stamp));
+	}
+	part->stamp = now;
+}
+
+void part_round_stats_mq(struct request_queue *q, int cpu,
+			 struct hd_struct *part)
+{
+	unsigned long now = jiffies;
+
+	if (part->partno)
+		__part_round_stats_mq(q, cpu, &part_to_disk(part)->part0, now);
+	__part_round_stats_mq(q, cpu, part, now);
+}
+
 static int blk_mq_poll_stats_bkt(const struct request *rq)
 {
 	int ddir, bytes, bucket;
diff --git a/block/blk.h b/block/blk.h
index 01ebb8185f6b..4803289467ac 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -264,6 +264,11 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
 		q->last_merge = NULL;
 }
 
+extern void part_round_stats_mq(struct request_queue *q, int cpu,
+					struct hd_struct *part);
+extern unsigned long part_in_flight_mq(struct request_queue *q,
+					struct hd_struct *part);
+
 /*
  * Internal io_context interface
  */
diff --git a/block/genhd.c b/block/genhd.c
index 7f520fa25d16..8ec19773ce68 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -1204,6 +1204,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 	struct disk_part_iter piter;
 	struct hd_struct *hd;
 	char buf[BDEVNAME_SIZE];
+	unsigned long inflight;
 	int cpu;
 
 	/*
@@ -1217,10 +1218,17 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 	disk_part_iter_init(&piter, gp, DISK_PITER_INCL_EMPTY_PART0);
 	while ((hd = disk_part_iter_next(&piter))) {
 		cpu = part_stat_lock();
-		part_round_stats(cpu, hd);
+		if (gp->queue->mq_ops)
+			part_round_stats_mq(gp->queue, cpu, hd);
+		else
+			part_round_stats(cpu, hd);
 		part_stat_unlock();
+		if (gp->queue->mq_ops)
+			inflight = part_in_flight_mq(gp->queue, hd);
+		else
+			inflight = part_in_flight(hd);
 		seq_printf(seqf, "%4d %7d %s %lu %lu %lu "
-			   "%u %lu %lu %lu %u %u %u %u\n",
+			   "%u %lu %lu %lu %u %lu %u %u\n",
 			   MAJOR(part_devt(hd)), MINOR(part_devt(hd)),
 			   disk_name(gp, hd->partno, buf),
 			   part_stat_read(hd, ios[READ]),
@@ -1231,7 +1239,7 @@ static int diskstats_show(struct seq_file *seqf, void *v)
 			   part_stat_read(hd, merges[WRITE]),
 			   part_stat_read(hd, sectors[WRITE]),
 			   jiffies_to_msecs(part_stat_read(hd, ticks[WRITE])),
-			   part_in_flight(hd),
+			   inflight,
 			   jiffies_to_msecs(part_stat_read(hd, io_ticks)),
 			   jiffies_to_msecs(part_stat_read(hd, time_in_queue))
 			);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..94aa92c3c010 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -19,6 +19,7 @@
 #include <linux/blktrace_api.h>
 
 #include "partitions/check.h"
+#include "blk.h"
 
 #ifdef CONFIG_BLK_DEV_MD
 extern void md_autodetect_dev(dev_t dev);
@@ -111,16 +112,27 @@ static ssize_t part_discard_alignment_show(struct device *dev,
 ssize_t part_stat_show(struct device *dev,
 		       struct device_attribute *attr, char *buf)
 {
+	struct gendisk *disk = dev_to_disk(dev);
 	struct hd_struct *p = dev_to_part(dev);
+	unsigned long inflight;
 	int cpu;
 
 	cpu = part_stat_lock();
-	part_round_stats(cpu, p);
+	if (disk->queue->mq_ops)
+		part_round_stats_mq(disk->queue, cpu, p);
+	else
+		part_round_stats(cpu, p);
 	part_stat_unlock();
+
+	if (disk->queue->mq_ops)
+		inflight = part_in_flight_mq(disk->queue, p);
+	else
+		inflight = part_in_flight(p);
+
 	return sprintf(buf,
 		"%8lu %8lu %8llu %8u "
 		"%8lu %8lu %8llu %8u "
-		"%8u %8u %8u"
+		"%8lu %8u %8u"
 		"\n",
 		part_stat_read(p, ios[READ]),
 		part_stat_read(p, merges[READ]),
@@ -130,7 +142,7 @@ ssize_t part_stat_show(struct device *dev,
 		part_stat_read(p, merges[WRITE]),
 		(unsigned long long)part_stat_read(p, sectors[WRITE]),
 		jiffies_to_msecs(part_stat_read(p, ticks[WRITE])),
-		part_in_flight(p),
+		inflight,
 		jiffies_to_msecs(part_stat_read(p, io_ticks)),
 		jiffies_to_msecs(part_stat_read(p, time_in_queue)));
 }
@@ -140,8 +152,8 @@ ssize_t part_inflight_show(struct device *dev,
 {
 	struct hd_struct *p = dev_to_part(dev);
 
-	return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]),
-		atomic_read(&p->in_flight[1]));
+	return sprintf(buf, "%8lu %8lu\n", part_stat_read(p, in_flight[0]),
+		part_stat_read(p, in_flight[1]));
 }
 
 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 402946035308..1034abffd10d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -517,9 +517,9 @@ static void start_io_acct(struct dm_io *io)
 
 	cpu = part_stat_lock();
 	part_round_stats(cpu, &dm_disk(md)->part0);
+	part_inc_in_flight(cpu, &dm_disk(md)->part0, rw);
+	atomic_inc(&md->pending[rw]);
 	part_stat_unlock();
-	atomic_set(&dm_disk(md)->part0.in_flight[rw],
-		atomic_inc_return(&md->pending[rw]));
 
 	if (unlikely(dm_stats_used(&md->stats)))
 		dm_stats_account_io(&md->stats, bio_data_dir(bio),
@@ -532,7 +532,7 @@ static void end_io_acct(struct dm_io *io)
 	struct mapped_device *md = io->md;
 	struct bio *bio = io->bio;
 	unsigned long duration = jiffies - io->start_time;
-	int pending;
+	int pending, cpu;
 	int rw = bio_data_dir(bio);
 
 	generic_end_io_acct(rw, &dm_disk(md)->part0, io->start_time);
@@ -546,9 +546,11 @@ static void end_io_acct(struct dm_io *io)
 	 * After this is decremented the bio must not be touched if it is
 	 * a flush.
 	 */
+	cpu = part_stat_lock();
 	pending = atomic_dec_return(&md->pending[rw]);
-	atomic_set(&dm_disk(md)->part0.in_flight[rw], pending);
+	part_dec_in_flight(cpu, &dm_disk(md)->part0, rw);
 	pending += atomic_read(&md->pending[rw^0x1]);
+	part_stat_unlock();
 
 	/* nudge anyone waiting on suspend queue */
 	if (!pending)
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..d050a509bdd4 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -87,6 +87,7 @@ struct disk_stats {
 	unsigned long ticks[2];
 	unsigned long io_ticks;
 	unsigned long time_in_queue;
+	unsigned long in_flight[2];
 };
 
 #define PARTITION_META_INFO_VOLNAMELTH	64
@@ -120,7 +121,6 @@ struct hd_struct {
 	int make_it_fail;
 #endif
 	unsigned long stamp;
-	atomic_t in_flight[2];
 #ifdef	CONFIG_SMP
 	struct disk_stats __percpu *dkstats;
 #else
@@ -292,6 +292,17 @@ extern struct hd_struct *disk_map_sector_rcu(struct gendisk *disk,
 #define __part_stat_add(cpu, part, field, addnd)			\
 	(per_cpu_ptr((part)->dkstats, (cpu))->field += (addnd))
 
+#define part_stat_read_double(part, field1, field2)					\
+({									\
+	typeof((part)->dkstats->field1) res = 0;				\
+	unsigned int _cpu;						\
+	for_each_possible_cpu(_cpu) {					\
+		res += per_cpu_ptr((part)->dkstats, _cpu)->field1;	\
+		res += per_cpu_ptr((part)->dkstats, _cpu)->field2;	\
+	}								\
+	res;								\
+})
+
 #define part_stat_read(part, field)					\
 ({									\
 	typeof((part)->dkstats->field) res = 0;				\
@@ -362,23 +373,23 @@ static inline void free_part_stats(struct hd_struct *part)
 #define part_stat_sub(cpu, gendiskp, field, subnd)			\
 	part_stat_add(cpu, gendiskp, field, -subnd)
 
-static inline void part_inc_in_flight(struct hd_struct *part, int rw)
+static inline void part_inc_in_flight(int cpu, struct hd_struct *part, int rw)
 {
-	atomic_inc(&part->in_flight[rw]);
+	part_stat_inc(cpu, part, in_flight[rw]);
 	if (part->partno)
-		atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+		part_stat_inc(cpu, &part_to_disk(part)->part0, in_flight[rw]);
 }
 
-static inline void part_dec_in_flight(struct hd_struct *part, int rw)
+static inline void part_dec_in_flight(int cpu, struct hd_struct *part, int rw)
 {
-	atomic_dec(&part->in_flight[rw]);
+	part_stat_dec(cpu, part, in_flight[rw]);
 	if (part->partno)
-		atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+		part_stat_dec(cpu, &part_to_disk(part)->part0, in_flight[rw]);
 }
 
-static inline int part_in_flight(struct hd_struct *part)
+static inline unsigned long part_in_flight(struct hd_struct *part)
 {
-	return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]);
+	return part_stat_read_double(part, in_flight[0], in_flight[1]);
 }
 
 static inline struct partition_meta_info *alloc_part_info(struct gendisk *disk)
-- 
Jens Axboe
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help