Thread (10 messages) 10 messages, 2 authors, 2018-03-26

Re: [PATCH] block, bfq: keep peak_rate estimation within range 1..2^32-1

From: Paolo Valente <hidden>
Date: 2018-03-26 14:06:19
Also in: lkml

Il giorno 26 mar 2018, alle ore 12:28, Konstantin Khlebnikov =
[off-list ref] ha scritto:
=20
=20
=20
On 26.03.2018 11:01, Paolo Valente wrote:
quoted
quoted
Il giorno 21 mar 2018, alle ore 00:49, Paolo Valente =
[off-list ref] ha scritto:
quoted
quoted
=20
=20
=20
quoted
Il giorno 20 mar 2018, alle ore 15:41, Konstantin Khlebnikov =
[off-list ref] ha scritto:
quoted
quoted
quoted
=20
On 20.03.2018 06:00, Paolo Valente wrote:
quoted
quoted
Il giorno 19 mar 2018, alle ore 14:28, Konstantin Khlebnikov =
[off-list ref] ha scritto:
quoted
quoted
quoted
quoted
quoted
=20
On 19.03.2018 09:03, Paolo Valente wrote:
quoted
quoted
Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov =
[off-list ref] ha scritto:
quoted
quoted
quoted
quoted
quoted
quoted
quoted
=20
Rate should never overflow or become zero because it is used as =
divider.
quoted
quoted
quoted
quoted
quoted
quoted
quoted
This patch accumulates it with saturation.
=20
Signed-off-by: Konstantin Khlebnikov =
[off-list ref]
quoted
quoted
quoted
quoted
quoted
quoted
quoted
---
block/bfq-iosched.c |    8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
=20
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index aeca22d91101..a236c8d541b5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2546,7 +2546,8 @@ static void =
bfq_reset_rate_computation(struct bfq_data *bfqd,
quoted
quoted
quoted
quoted
quoted
quoted
quoted
=20
static void bfq_update_rate_reset(struct bfq_data *bfqd, struct =
request *rq)
quoted
quoted
quoted
quoted
quoted
quoted
quoted
{
-	u32 rate, weight, divisor;
+	u32 weight, divisor;
+	u64 rate;
=20
	/*
	 * For the convergence property to hold (see comments on
@@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct =
bfq_data *bfqd, struct request *rq)
quoted
quoted
quoted
quoted
quoted
quoted
quoted
	 */
	bfqd->peak_rate *=3D divisor-1;
	bfqd->peak_rate /=3D divisor;
-	rate /=3D divisor; /* smoothing constant alpha =3D =
1/divisor */
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+	do_div(rate, divisor);	/* smoothing constant alpha =3D =
1/divisor */
quoted
quoted
quoted
quoted
quoted
quoted
quoted
=20
-	bfqd->peak_rate +=3D rate;
+	/* rate should never overlow or become zero */
It is bfqd->peak_rate that is used as a divider, and =
bfqd->peak_rate doesn't risk to be zero even if the variable 'rate' is =
zero here.
quoted
quoted
quoted
quoted
quoted
quoted
So I guess the reason why you consider the possibility that =
bfqd->peak_rate becomes zero is because of an overflow when summing =
'rate'. But, according to my calculations, this should be impossible =
with devices with sensible speeds.
quoted
quoted
quoted
quoted
quoted
quoted
These are the reasons why I decided I could make it with a =
32-bit variable, without any additional clamping. Did I make any mistake =
in my evaluation?
quoted
quoted
quoted
quoted
quoted
=20
According to Murphy's law this is inevitable..
=20
Yep.  Actually Murphy has been even clement this time, by making =
the
quoted
quoted
quoted
quoted
failure occur to a kernel expert :)
quoted
I've seen couple division by zero crashes in bfq_wr_duration.
Unfortunately logs weren't recorded.
=20
quoted
Anyway, even if I made some mistake about the maximum possible =
value of the device rate, and the latter may be too high for =
bfqd->peak_rate to contain it, then I guess the right solution would not =
be to clamp the actual rate to U32_MAX, but to move bfqd->peak_rate to =
64 bits. Or am I missing something else?
quoted
quoted
quoted
quoted
quoted
quoted
quoted
quoted
+	bfqd->peak_rate =3D clamp_t(u64, rate + bfqd->peak_rate, =
1, U32_MAX);
quoted
quoted
quoted
quoted
quoted
=20
32-bit should be enough and better for division.
My patch makes sure it never overflows/underflows.
That's cheaper than full 64-bit/64-bit division.
Anyway 64-bit speed could overflow too. =3D)
=20
I see your point.  Still, if the mistake is not in sizing, then =
you
quoted
quoted
quoted
quoted
bumped into some odd bug.  In this respect, I don't like much the =
idea
quoted
quoted
quoted
quoted
of sweeping the dust under the carpet.  So, let me ask you for a
little bit more help.  With your patch applied, and thus with no =
risk
quoted
quoted
quoted
quoted
of crashes, what about adding, right before your clamp_t, =
something
quoted
quoted
quoted
quoted
like:
if (!bfqd->peak_rate)
	pr_crit(<dump of all the variables involved in updating =
bfqd->peak_rate>);
quoted
quoted
quoted
quoted
Once the failure shows up (Murphy permitting), we might have hints =
to
quoted
quoted
quoted
quoted
the bug causing it.
Apart from that, I have no problem with patches that make bfq more
robust, even in a sort of black-box way.
=20
This rate estimator is already works as a black-box with smoothing =
and
quoted
quoted
quoted
low-pass filter inside.
=20
Actually, it is just what you say: a low-pass filter, which works by
deflating two non-null quantities (the previous and the current rate
rate), by and adding these quantities to each other (to get the new
estimated rate).  In addition to being larger than 0, both =
quantities
quoted
quoted
are much lower than U32_MAX, so this computation should never yield
zero because of an overflow.
=20
Even some occasional zeros read for the sampled rate (because of the
suspicion you express below), cannot lead to a null estimated rate.
In fact, this is a low-pass filter, meant to cut off occasional
spurious values.
=20
I'm insisting on this part, just because, maybe, you say that it =
"acts
quoted
quoted
as a black-box" because there is something more, which I'm =
overlooking
quoted
quoted
here, and maybe the bug is exactly in the part I'm overlooking.  =
But,
quoted
quoted
if this is not the case, I'll just stop now for this point.
=20
quoted
It has limitations thus this is ok to declare
range of expected\sane results.
=20
=20
Correct, but if the odd value you cut off is a random value that
follows from a bug (and I see no other possibility, in view of the
above arguments), then the result may still be completely wrong.  =
For
quoted
quoted
example:
- the device may have a rate of, say, 200 MB/s
- the computation of the new estimated rate yields 0
- you limit this result to ~0 MB/s but >0 MB/s; this avoids division
errors, but still leaves you with a completely wrong rate, and a
consequently inconsistent behavior of the scheduler
=20
quoted
It would be nice to show estimated rate in sysfs to let user
diagnose whenever it is wrong for a long period of time.
=20
Absolutely.  This is actually already shown indirectly, through the
budget_max parameter, which is the number of sectors served at peak
rate during slice_sync.
=20
quoted
Printing message all the time even ratelimited is a wrong idea.
If this should never happens - WARN_ONCE is more than enough.
=20
=20
I'm sorry, I was totally unclear here.  My proposal for you is:
since this failure does occur in your system, please add, =
temporarily, the
quoted
quoted
printing of all involved values *all the times*, so that you can =
track down
quoted
quoted
the bug, and fix the actual problem leading to this division by =
zero.
quoted
quoted
=20
What do you think?
=20
quoted
I suspect crashes might be caused by bumpy timer which was affected =
by smi/nmi from mce.
quoted
quoted
=20
Yep.  The problem is that, for the estimated rate to go to zero, a
null value must be sampled for several consecutive times.
=20
quoted
I'll try to find evidence.
=20
Getting to the bottom of this would be really great.
=20
Looking forward to your findings,
I guess no news yet.  Anyway, your bug report has pushed me to think
over the representation of the rate.  And I may have found a problem,
which might be the cause of the zeros you have seen.
With a little math, it comes out that, for a sampled rate below about
5 KB/s, 0 is stored in the variable 'rate'.  The reason is that such
low values cannot be represented with the maximum precision currently
available in BFQ for rates.  So, if these very low values are sampled
for a few consecutive times, then the estimated peak rate may =
actually
quoted
become zero.
In the system in which you have seen null estimated rates, may the
virtual or physical storage device be occasionally so slow?
=20
I've found couple crashes from qemu-kvm virtual machines.
Their disks are slow for sure. Zero rate for a minute wouldn't be a =
surprise.
=20
Ok, looks like we are converging.  Hoping that it is ok for you, I
took the liberty to prepare a patch that only lower-bounds the
estimated rate to 1, and that, to prevent future problems of this
kind, adds comments to report the range of peak rates representable.
I have added you as signer of the patch, being this new patch just a
subset of yours.

If this patch is the right solution to the failure you detected, then
the latter should just go away.

Thank you,
Paolo
<4>[125253.240885] divide error: 0000 [#1] SMP PTI
<4>[125253.241031] Modules linked in: netconsole configfs tcp_diag =
inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT =
nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 =
nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT =
nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 =
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables =
x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc 8021q garp mrp =
stp llc drm_kms_helper syscopyarea sysfillrect sysimgblt =
snd_hda_codec_generic fb_sys_fops snd_hda_intel ttm drm snd_hda_codec =
snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore i2c_piix4 ppdev =
kvm_intel kvm irqbypass input_leds parport_pc fuse lp parport xfs raid10 =
raid456 async_raid6_recov async_memcpy async_pq async_xor
<4>[125253.242257]  async_tx xor raid6_pq libcrc32c raid1 raid0 =
multipath linear ata_generic pata_acpi floppy [last unloaded: =
netconsole]
<4>[125253.242537] CPU: 0 PID: 36766 Comm: push-client Not tainted =
4.14.15-5 #1
<4>[125253.242731] Hardware name: QEMU Standard PC (i440FX + PIIX, =
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
<4>[125253.242928] task: ffff880210ed3500 task.stack: ffffc90002d78000
<4>[125253.243092] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
<4>[125253.243240] RSP: 0018:ffffc90002d7b9e0 EFLAGS: 00010046
<4>[125253.243357] RAX: 00000000000d59f8 RBX: ffff88023014af00 RCX: =
000000000000001e
<4>[125253.243567] RDX: 0000000000000000 RSI: ffff8802302685a0 RDI: =
0000000000000000
<4>[125253.243785] RBP: ffff880230268400 R08: 0000000000000001 R09: =
0000000000000001
<4>[125253.244004] R10: ffff8802305341b0 R11: ffff880230268400 R12: =
ffff88023066cb60
<4>[125253.244222] R13: 0000000000000000 R14: ffff8802305341b0 R15: =
ffffc90002d7ba88
<4>[125253.244489] FS:  00007f74d59e6700(0000) =
GS:ffff88023fc00000(0000) knlGS:0000000000000000
<4>[125253.244784] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[125253.245073] CR2: 00007f78e7261258 CR3: 000000021275e000 CR4: =
00000000000006f0
<4>[125253.245406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: =
0000000000000000
<4>[125253.245696] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: =
0000000000000400
<4>[125253.245961] Call Trace:
<4>[125253.246097]  bfq_insert_requests+0xda7/0xf60
<4>[125253.246313]  blk_mq_sched_insert_request+0xf9/0x150
<4>[125253.246520]  blk_mq_make_request+0x1da/0x630
<4>[125253.246692]  generic_make_request+0xfb/0x2d0
<4>[125253.246847]  ? submit_bio+0x73/0x150
<4>[125253.246984]  submit_bio+0x73/0x150
<4>[125253.247229]  xfs_submit_ioend+0x80/0x1a0 [xfs]
<4>[125253.247466]  xfs_do_writepage+0x4d7/0x710 [xfs]
<4>[125253.247636]  ? invalid_page_referenced_vma+0x90/0x90
<4>[125253.247906]  write_cache_pages+0x23a/0x4e0
<4>[125253.248166]  ? xfs_add_to_ioend+0x240/0x240 [xfs]
<4>[125253.248345]  ? dequeue_entity+0x2f4/0xa90
<4>[125253.248596]  xfs_vm_writepages+0x94/0xd0 [xfs]
<4>[125253.248793]  do_writepages+0x17/0x70
<4>[125253.249020]  __filemap_fdatawrite_range+0xa7/0xe0
<4>[125253.249269]  file_write_and_wait_range+0x4c/0xa0
<4>[125253.249514]  xfs_file_fsync+0x55/0x220 [xfs]
<4>[125253.249719]  do_fsync+0x38/0x60
<4>[125253.249877]  SyS_fsync+0xc/0x10
<4>[125253.250098]  entry_SYSCALL_64_fastpath+0x20/0x83
<4>[125253.250285] RIP: 0033:0x7f74e92a6a7d
<4>[125253.250484] RSP: 002b:00007f74d59dcc90 EFLAGS: 00000293
<4>[125253.250485] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff =
ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 =
31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 =
0f 48
<1>[125253.251304] RIP: bfq_wr_duration.isra.15.part.16+0xc/0x40 RSP: =
ffffc90002d7b9e0
=20
And this one definitely have timer issues.
=20
<4>[69620.113311] hpet1: lost 1123 rtc interrupts
<4>[69725.670214] hpet1: lost 1257 rtc interrupts
<4>[69730.301092] hpet1: lost 295 rtc interrupts
<4>[69785.817074] hpet1: lost 804 rtc interrupts
<4>[70276.099325] hpet1: lost 1138 rtc interrupts
<4>[70338.962431] hpet1: lost 1274 rtc interrupts
<4>[70687.531960] hpet1: lost 314 rtc interrupts
<4>[70708.789681] hpet1: lost 1359 rtc interrupts
<4>[70892.699792] hpet1: lost 775 rtc interrupts
<4>[70985.631630] hpet1: lost 448 rtc interrupts
<4>[71045.952024] hpet1: lost 1109 rtc interrupts
<4>[71236.781185] hpet1: lost 1216 rtc interrupts
<4>[71253.521384] hpet1: lost 1071 rtc interrupts
<4>[71477.087387] hpet1: lost 563 rtc interrupts
<4>[71568.587212] hpet1: lost 357 rtc interrupts
<4>[71837.330897] hpet1: lost 705 rtc interrupts
<4>[72029.244809] hpet1: lost 1286 rtc interrupts
<4>[72251.173477] hpet1: lost 458 rtc interrupts
<4>[72398.366782] hpet1: lost 1172 rtc interrupts
<4>[73218.363935] hpet1: lost 249 rtc interrupts
<4>[73280.472760] hpet1: lost 1225 rtc interrupts
<4>[73458.506121] hpet1: lost 398 rtc interrupts
<4>[73504.331306] hpet1: lost 183 rtc interrupts
<4>[73596.270009] hpet1: lost 386 rtc interrupts
<4>[73858.989240] hpet1: lost 320 rtc interrupts
<4>[73903.326725] hpet1: lost 88 rtc interrupts
<4>[74170.760022] hpet1: lost 622 rtc interrupts
<4>[74517.835154] hpet1: lost 216 rtc interrupts
<4>[74609.630248] hpet1: lost 375 rtc interrupts
<4>[74884.937850] hpet1: lost 1126 rtc interrupts
<4>[74886.291818] hpet1: lost 86 rtc interrupts
<4>[74902.415909] hpet1: lost 1031 rtc interrupts
<4>[75134.991339] hpet1: lost 1140 rtc interrupts
<4>[75139.948051] hpet1: lost 316 rtc interrupts
<4>[75183.709393] hpet1: lost 50 rtc interrupts
<4>[75486.178692] hpet1: lost 115 rtc interrupts
<4>[75634.770692] hpet1: lost 1261 rtc interrupts
<4>[76322.311822] hpet1: lost 18 rtc interrupts
<4>[76941.322201] hpet1: lost 1128 rtc interrupts
<4>[77341.716930] hpet1: lost 884 rtc interrupts
<4>[77347.532830] hpet1: lost 371 rtc interrupts
<4>[77348.620826] hpet1: lost 68 rtc interrupts
<4>[77362.430004] divide error: 0000 [#1] SMP PTI
<4>[77362.430139] Modules linked in: netconsole configfs tcp_diag =
inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT =
nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 =
nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT =
nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 =
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables =
x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc =
snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core =
snd_pcm snd_timer drm_kms_helper snd syscopyarea sysfillrect sysimgblt =
fb_sys_fops ttm soundcore drm 8021q garp mrp stp llc ppdev i2c_piix4 =
kvm_intel input_leds kvm irqbypass parport_pc fuse lp parport xfs raid10 =
raid456 async_raid6_recov async_memcpy async_pq async_xor
<4>[77362.431347]  async_tx xor raid6_pq libcrc32c raid1 raid0 floppy =
multipath linear ata_generic pata_acpi [last unloaded: netconsole]
<4>[77362.431545] CPU: 0 PID: 39134 Comm: skynet.heartbea Not tainted =
4.14.15-5 #1
<4>[77362.431720] Hardware name: QEMU Standard PC (i440FX + PIIX, =
1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
<4>[77362.432013] task: ffff8800964adcc0 task.stack: ffffc900030bc000
<4>[77362.432185] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
<4>[77362.432348] RSP: 0018:ffffc900030bfa90 EFLAGS: 00010046
<4>[77362.432552] RAX: 00000000000d59f8 RBX: ffff8801348d6900 RCX: =
000000000000001e
<4>[77362.432725] RDX: 0000000000000000 RSI: ffff8801348c1da0 RDI: =
0000000000000000
<4>[77362.432992] RBP: ffff8801348c1c00 R08: 0000000000000001 R09: =
0000000000000001
<4>[77362.433214] R10: ffff8800b9eacf30 R11: ffff8801348c1c00 R12: =
ffff880134decb60
<4>[77362.433466] R13: 0000000000000000 R14: ffff8800b9eacf30 R15: =
ffffc900030bfb38
<4>[77362.433696] FS:  00007fe28c721700(0000) =
GS:ffff88013fc00000(0000) knlGS:0000000000000000
<4>[77362.433948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[77362.434125] CR2: 0000000001370ec8 CR3: 0000000135b2c000 CR4: =
00000000000006f0
<4>[77362.434328] DR0: 0000000000000000 DR1: 0000000000000000 DR2: =
0000000000000000
<4>[77362.434571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: =
0000000000000400
<4>[77362.434828] Call Trace:
<4>[77362.434997]  bfq_insert_requests+0xda7/0xf60
<4>[77362.435150]  blk_mq_sched_insert_request+0xf9/0x150
<4>[77362.435366]  blk_mq_make_request+0x1da/0x630
<4>[77362.435516]  generic_make_request+0xfb/0x2d0
<4>[77362.435701]  ? submit_bio+0x73/0x150
<4>[77362.435827]  submit_bio+0x73/0x150
<4>[77362.436100]  xfs_submit_ioend+0x80/0x1a0 [xfs]
<4>[77362.436339]  xfs_vm_writepages+0xa8/0xd0 [xfs]
<4>[77362.436549]  do_writepages+0x17/0x70
<4>[77362.436692]  __filemap_fdatawrite_range+0xa7/0xe0
<4>[77362.436884]  filemap_write_and_wait_range+0x30/0x80
<4>[77362.437106]  xfs_setattr_size+0x121/0x380 [xfs]
<4>[77362.437313]  xfs_vn_setattr+0x6b/0x90 [xfs]
<4>[77362.437472]  notify_change+0x2da/0x400
<4>[77362.437663]  do_truncate+0x5b/0x90
<4>[77362.437842]  do_sys_ftruncate.constprop.18+0x10b/0x170
<4>[77362.438033]  ? SyS_write+0x66/0x90
<4>[77362.438251]  entry_SYSCALL_64_fastpath+0x20/0x83
<4>[77362.438429] RIP: 0033:0x7fe293e8bd87
<4>[77362.438586] RSP: 002b:00007fe28c71fb28 EFLAGS: 00000246
<4>[77362.438587] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff =
ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 =
31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 =
0f 48
=20
=20
quoted
Thanks,
Paolo
quoted
Paolo
=20
quoted
=20
quoted
Thanks a lot,
Paolo
quoted
=20
quoted
quoted
	update_thr_responsiveness_params(bfqd);
=20
reset_computation:
Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help