Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu
From: Ming Lei <hidden>
Date: 2017-06-30 01:08:01
Also in:
dm-devel
Subsystem:
block layer, device-mapper (lvm), the rest · Maintainers:
Jens Axboe, Alasdair Kergon, Mike Snitzer, Mikulas Patocka, Benjamin Marzinski, Linus Torvalds
On Thu, Jun 29, 2017 at 12:31:05PM -0500, Brian King wrote:
On 06/29/2017 11:25 AM, Ming Lei wrote:quoted
On Thu, Jun 29, 2017 at 11:58 PM, Jens Axboe [off-list ref] wrote:quoted
On 06/29/2017 02:40 AM, Ming Lei wrote:quoted
On Thu, Jun 29, 2017 at 5:49 AM, Jens Axboe [off-list ref] wrote:quoted
On 06/28/2017 03:12 PM, Brian King wrote:quoted
This patch converts the in_flight counter in struct hd_struct from a pair of atomics to a pair of percpu counters. This eliminates a cou=
ple
quoted
quoted
quoted
quoted
quoted
of atomics from the hot path. When running this on a Power system, =
to
quoted
quoted
quoted
quoted
quoted
a single null_blk device with 80 submission queues, irq mode 0, with 80 fio jobs, I saw IOPs go from 1.5M IO/s to 11.4 IO/s.This has been done before, but I've never really liked it. The reaso=
n is
quoted
quoted
quoted
quoted
that it means that reading the part stat inflight count now has to iterate over every possible CPU. Did you use partitions in your test=
ing?
quoted
quoted
quoted
quoted
How many CPUs were configured? When I last tested this a few years a=
go
quoted
quoted
quoted
quoted
on even a quad core nehalem (which is notoriously shitty for cross-n=
ode
quoted
quoted
quoted
quoted
latencies), it was a net loss.One year ago, I saw null_blk's IOPS can be decreased to 10% of non-RQF_IO_STAT on a dual socket ARM64(each CPU has 96 cores, and dual numa nodes) too, the performance can be recovered basically if per numa-node counter is introduced and used in this case, but the patch was never posted out. If anyone is interested in that, I can rebase the patch on current block tree and post out. I guess the performance issue might be related with system cache coherency implementation more or less. This issue on ARM64 can be observed with the following userspace atomic counting test too: http://kernel.ubuntu.com/~ming/test/cache/How well did the per-node thing work? Doesn't seem to me like it would=20 Last time, on ARM64, I remembered that the IOPS was basically recovered, but now I don't have a such machine to test. Could Brian test the attac=
hed patch
quoted
to see if it works on big Power machine? =20 And the idea is simple, just make the atomic counter per-node.=20 I tried loading the patch and get an oops right away on boot. Haven't bee=
n able to=20
debug anything yet. This is on top of an older kernel, so not sure if tha=
t is
the issue or not. I can try upstream and see if i have different results.=
=2E.
=20 =1B[-1;-1fUbuntu 16.04=1B[-1;-1f. . . .Unable to handle kernel paging =
request for data at address 0xc00031313a333532
Faulting instruction address: 0xc0000000002552c4 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=3D1024 NUMA=20 PowerNV Modules linked in: dm_round_robin vmx_crypto powernv_rng leds_powernv pow=
ernv_op_panel led_class rng_core dm_multipath autofs4 raid10 raid456 async_= raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath= dm_mirror dm_region_hash dm_log cxlflash bnx2x mdio libcrc32c nvme nvme_co= re lpfc cxl crc_t10dif crct10dif_generic crct10dif_common
CPU: 9 PID: 3485 Comm: multipathd Not tainted 4.9.10-dirty #7 task: c000000fba0d0000 task.stack: c000000fb8a1c000 NIP: c0000000002552c4 LR: c000000000255274 CTR: 0000000000000000 REGS: c000000fb8a1f350 TRAP: 0300 Not tainted (4.9.10-dirty) MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24028848 =
XER: 00000000
CFAR: c000000000008a60 DAR: c00031313a333532 DSISR: 40000000 SOFTE: 1=20 GPR00: c0000000001f8980 c000000fb8a1f5d0 c000000000f24300 c000000fc4007c0=
0=20
GPR04: 00000000024000c0 c0000000002fc0ac 000000000000025d c000000fc5d346e=
0=20
GPR08: 0000000fc50d6000 0000000000000000 c00031313a333532 c000000fbc83624=
0=20
GPR12: 0000000000002200 c00000000ff02400 00003fff79cfebf0 000000000000000=
0=20
GPR16: 00000000c138fd03 000001003a12bf70 00003fff79ae75e0 00003fff79d269e=
8=20
GPR20: 0000000000000001 00003fff79cfebf0 000001003a393aa0 00003fff79d067b=
8=20
GPR24: 0000000000000000 c000000000b04948 000000000000a1ff c0000000002fc0a=
c=20
GPR28: 00000000024000c0 0000000000000007 c0000000002fc0ac c000000fc4007c0=
0=20
NIP [c0000000002552c4] __kmalloc_track_caller+0x94/0x2f0 LR [c000000000255274] __kmalloc_track_caller+0x44/0x2f0 Call Trace: [c000000fb8a1f5d0] [c000000fb8a1f610] 0xc000000fb8a1f610 (unreliable) [c000000fb8a1f620] [c0000000001f8980] kstrdup+0x50/0xb0 [c000000fb8a1f660] [c0000000002fc0ac] __kernfs_new_node+0x4c/0x140 [c000000fb8a1f6b0] [c0000000002fd9f4] kernfs_new_node+0x34/0x80 [c000000fb8a1f6e0] [c000000000300708] kernfs_create_link+0x38/0xf0 [c000000fb8a1f720] [c000000000301cb8] sysfs_do_create_link_sd.isra.0+0xa8=
/0x160
[c000000fb8a1f770] [c00000000054a658] device_add+0x2b8/0x740 [c000000fb8a1f830] [c00000000054ae58] device_create_groups_vargs+0x178/0x=
190
[c000000fb8a1f890] [c0000000001fcd70] bdi_register+0x80/0x1d0 [c000000fb8a1f8c0] [c0000000001fd244] bdi_register_owner+0x44/0x80 [c000000fb8a1f940] [c000000000453bbc] device_add_disk+0x1cc/0x500 [c000000fb8a1f9f0] [c000000000764dec] dm_create+0x33c/0x5f0 [c000000fb8a1fa90] [c00000000076d9ac] dev_create+0x8c/0x3e0 [c000000fb8a1fb40] [c00000000076d1fc] ctl_ioctl+0x38c/0x580 [c000000fb8a1fd20] [c00000000076d410] dm_ctl_ioctl+0x20/0x30 [c000000fb8a1fd40] [c0000000002799ac] do_vfs_ioctl+0xcc/0x8f0 [c000000fb8a1fde0] [c00000000027a230] SyS_ioctl+0x60/0xc0 [c000000fb8a1fe30] [c00000000000bfe0] system_call+0x38/0xfc Instruction dump: 39290008 7cc8482a e94d0030 e9230000 7ce95214 7d49502a e9270010 2faa0000=
=20
419e007c 2fa90000 419e0074 e93f0022 <7f4a482a> 39200000 88ad023a 992d023a=
=20
---[ end trace dcdac2d3f668d033 ]---
Thanks for the test! Looks there is one double free in patch, could you test the new one? ---
diff --git a/block/partition-generic.c b/block/partition-generic.c
index c5ec8246e25e..bd6644bf9643 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c@@ -140,8 +140,9 @@ ssize_t part_inflight_show(struct device *dev, { struct hd_struct *p =3D dev_to_part(dev);
=20 - return sprintf(buf, "%8u %8u\n", atomic_read(&p->in_flight[0]), - atomic_read(&p->in_flight[1])); + return sprintf(buf, "%8u %8u\n", + pnode_counter_read_all(&p->in_flight[0]), + pnode_counter_read_all(&p->in_flight[1])); } =20 #ifdef CONFIG_FAIL_MAKE_REQUEST
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 96bd13e581cd..aac0b2235410 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c@@ -521,7 +521,7 @@ static void start_io_acct(struct dm_io *io) cpu =3D part_stat_lock(); part_round_stats(cpu, &dm_disk(md)->part0); part_stat_unlock(); - atomic_set(&dm_disk(md)->part0.in_flight[rw], + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], atomic_inc_return(&md->pending[rw]));
=20 if (unlikely(dm_stats_used(&md->stats)))
@@ -550,7 +550,7 @@ static void end_io_acct(struct dm_io *io) * a flush. */ pending =3D atomic_dec_return(&md->pending[rw]); - atomic_set(&dm_disk(md)->part0.in_flight[rw], pending); + pnode_counter_set(&dm_disk(md)->part0.in_flight[rw], pending); pending +=3D atomic_read(&md->pending[rw^0x1]);
=20 /* nudge anyone waiting on suspend queue */
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e619fae2f037..40c9bc74a120 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h@@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/percpu-refcount.h> #include <linux/uuid.h> +#include <linux/pernode_counter.h>
=20 #ifdef CONFIG_BLOCK =20
@@ -120,7 +121,7 @@ struct hd_struct { int make_it_fail; #endif unsigned long stamp; - atomic_t in_flight[2]; + struct pnode_counter in_flight[2]; #ifdef CONFIG_SMP struct disk_stats __percpu *dkstats; #else
@@ -364,21 +365,22 @@ static inline void free_part_stats(struct hd_struct *=part)
=20
static inline void part_inc_in_flight(struct hd_struct *part, int rw)
{
- atomic_inc(&part->in_flight[rw]);
+ pnode_counter_inc(&part->in_flight[rw]);
if (part->partno)
- atomic_inc(&part_to_disk(part)->part0.in_flight[rw]);
+ pnode_counter_inc(&part_to_disk(part)->part0.in_flight[rw]);
}
=20
static inline void part_dec_in_flight(struct hd_struct *part, int rw)
{
- atomic_dec(&part->in_flight[rw]);
+ pnode_counter_dec(&part->in_flight[rw]);
if (part->partno)
- atomic_dec(&part_to_disk(part)->part0.in_flight[rw]);
+ pnode_counter_dec(&part_to_disk(part)->part0.in_flight[rw]);
}
=20
static inline int part_in_flight(struct hd_struct *part)
{
- return atomic_read(&part->in_flight[0]) + atomic_read(&part->in_flight[1]=
);
+ return pnode_counter_read_all(&part->in_flight[0]) + \
+ pnode_counter_read_all(&part->in_flight[1]);
}
=20
static inline struct partition_meta_info *alloc_part_info(struct gendisk *=
disk)@@ -627,11 +629,34 @@ extern ssize_t part_fail_store(struct device *dev, const char *buf, size_t count); #endif /* CONFIG_FAIL_MAKE_REQUEST */
=20
+static inline int hd_counter_init(struct hd_struct *part)
+{
+ if (pnode_counter_init(&part->in_flight[0], GFP_KERNEL))
+ return -ENOMEM;
+ if (pnode_counter_init(&part->in_flight[1], GFP_KERNEL)) {
+ pnode_counter_deinit(&part->in_flight[0]);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
+static inline void hd_counter_deinit(struct hd_struct *part)
+{
+ pnode_counter_deinit(&part->in_flight[0]);
+ pnode_counter_deinit(&part->in_flight[1]);
+}
+
static inline int hd_ref_init(struct hd_struct *part)
{
+ if (hd_counter_init(part))
+ return -ENOMEM;
+
if (percpu_ref_init(&part->ref, __delete_partition, 0,
- GFP_KERNEL))
+ GFP_KERNEL)) {
+ hd_counter_deinit(part);
return -ENOMEM;
+ }
return 0;
}
=20@@ -659,6 +684,7 @@ static inline void hd_free_part(struct hd_struct *part) { free_part_stats(part); free_part_info(part); + hd_counter_deinit(part); percpu_ref_exit(&part->ref); }
=20
diff --git a/include/linux/pernode_counter.h b/include/linux/pernode_counte=r.h new file mode 100644 index 000000000000..e95b5f2d1d9c
--- /dev/null
+++ b/include/linux/pernode_counter.h@@ -0,0 +1,117 @@ +#ifndef _LINUX_PERNODE_COUNTER_H +#define _LINUX_PERNODE_COUNTER_H +/* + * A simple per node atomic counter for use in block io accounting. + */ + +#include <linux/smp.h> +#include <linux/percpu.h> +#include <linux/types.h> +#include <linux/gfp.h> + +struct node_counter { + atomic_t counter_in_node; +}; + +struct pnode_counter { + struct node_counter * __percpu *counter; + struct node_counter **nodes; +}; + +static inline int pnode_counter_init(struct pnode_counter *pnc, gfp_t gfp) +{ + struct node_counter **nodes; + int i, j, cpu; + + nodes =3D kzalloc(nr_node_ids * sizeof(struct node_counter *), gfp); + if (!nodes) + goto err_nodes; + + for_each_node(i) { + nodes[i] =3D kzalloc_node(sizeof(struct node_counter), gfp, i); + if (!nodes[i]) + goto err_node_counter; + } + + pnc->counter =3D alloc_percpu_gfp(struct node_counter *, gfp); + if (!pnc->counter) + goto err_node_counter; + + for_each_possible_cpu(cpu) + *per_cpu_ptr(pnc->counter, cpu) =3D nodes[cpu_to_node(cpu)]; + + pnc->nodes =3D nodes; + + return 0; + + err_node_counter: + for (j =3D 0; j < i; j++) + kfree(nodes[j]); + kfree(nodes); + err_nodes: + return -ENOMEM; +} + +static inline void pnode_counter_deinit(struct pnode_counter *pnc) +{ + int i; + + for_each_node(i) { + struct node_counter *node =3D pnc->nodes[i]; + + kfree(node); + } + free_percpu(pnc->counter); + kfree(pnc->nodes); +} + +static inline void pnode_counter_inc(struct pnode_counter *pnc) +{ + struct node_counter *node =3D this_cpu_read(*pnc->counter); + + atomic_inc(&node->counter_in_node); +} + +static inline void pnode_counter_inc_cpu(struct pnode_counter *pnc, int cp=
u)
+{
+ struct node_counter *node =3D *per_cpu_ptr(pnc->counter, cpu);
+
+ atomic_inc(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec(struct pnode_counter *pnc)
+{
+ struct node_counter *node =3D this_cpu_read(*pnc->counter);
+
+ atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_dec_cpu(struct pnode_counter *pnc, int cp=
u)
+{
+ struct node_counter *node =3D *per_cpu_ptr(pnc->counter, cpu);
+
+ atomic_dec(&node->counter_in_node);
+}
+
+static inline void pnode_counter_set(struct pnode_counter *pnc, int val)
+{
+ int i;
+ struct node_counter *node =3D this_cpu_read(*pnc->counter);
+
+ for_each_node(i)
+ atomic_set(&pnc->nodes[i]->counter_in_node, 0);
+ atomic_set(&node->counter_in_node, val);
+}
+
+static inline long pnode_counter_read_all(struct pnode_counter *pnc)
+{
+ int i;
+ long val =3D 0;
+
+ for_each_node(i)
+ val +=3D atomic_read(&pnc->nodes[i]->counter_in_node);
+
+ return val;
+}
+
+#endif /* _LINUX_PERNODE_COUNTER_H */
Thanks,
Ming