Re: Fwd: [PATCH] bcache: PI controller for writeback rate V2

From: Coly Li <hidden>
Date: 2017-09-08 17:17:38

On 2017/9/9 上午12:42, Michael Lyle wrote:

[sorry for resend, I am apparently not good at reply-all in gmail :P ]

On Thu, Sep 7, 2017 at 10:52 PM, Coly Li [off-list ref] wrote:
[snip history]

quoted

writeback_rate_mininum & writeback_rate are all readable/writable, and
writeback_rate_mininum should be less or equal to writeback_rate if I
understand correctly.

No, this is not true.  writeback_rate is writable, but the control
system replaces it at 5 second intervals.  This is the same as current
code.  If you want writeback_rate to do something as a tunable, you
should set writeback_percent to 0, which disables the control system
and lets you set your own value-- otherwise whatever change you make
is replaced in 5 seconds.

writeback_rate_minimum is for use cases when you want to force
writeback_rate to occur faster than the control system would choose on
its own.  That is, imagine you have an intermittent, write-heavy
workload, and when the system is idle you want to clear out the dirty
blocks.  The default rate of 1 sector per second would do this very
slowly-- instead you could pick a value that is a small percentage of
disk bandwidth (preserving latency characteristics) but still fast
enough to leave dirty space available.

quoted

Here I feel a check should be added here to make sure
writeback_rate_minimum <= writeback_rate when setting them into sysfs entry.

You usually (not always) will actually want to set
writeback_rate_minimum to faster than writeback_rate, to speed up the
current writeback rate.

This assumption is not always correct. If heavy front end I/Os coming
every "writeback_rate_update_seconds" seconds, the writeback rate just
raises to a high number, this situation may have negative contribution
to I/O latency of front end I/Os.

It may not be exact "writeback_rate_update_seconds" seconds, this is
just an example for some king of "interesting" I/O pattern to show that
higher writeback_rate_minimum may not always be helpful.

quoted

+     if ((error < 0 && dc->writeback_rate_integral > 0) ||
+         (error > 0 && time_before64(local_clock(),
+                      dc->writeback_rate.next + NSEC_PER_MSEC))) {
+             /* Only decrease the integral term if it's more than
+              * zero.  Only increase the integral term if the device
+              * is keeping up.  (Don't wind up the integral
+              * ineffectively in either case).
+              *
+              * It's necessary to scale this by
+              * writeback_rate_update_seconds to keep the integral
+              * term dimensioned properly.
+              */
+             dc->writeback_rate_integral += error *
+                     dc->writeback_rate_update_seconds;

I am not sure whether it is correct to calculate a integral value here.
error here is not a per-second value, it is already a accumulated result
in past "writeback_rate_update_seconds" seconds, what does it mean for
"error * dc->writeback_rate_update_seconds" ?

I know here you are calculating a integral value of error, but before I
understand why you use "error * dc->writeback_rate_update_seconds", I am
not able to say whether it is good or not.

The calculation occurs every writeback_rate_update_seconds.  An
integral is the area under a curve.

If the error is currently 1, and has been 1 for the past 5 seconds,
the integral increases by 1 * 5 seconds.  There are two approaches
used in numerical integration, a "rectangular integration" (which this
is, assuming the value has held for the last 5 seconds), and a
"triangular integration", where the average of the old value and the
new value are averaged and multiplied by the measurement interval.  It
doesn't really make a difference-- the triangular integration tends to
come up with a slightly more accurate value but adds some delay.  (In
this case, the integral has a time constant of thousands of
seconds...)

Hmm, imagine we have a per-second sampling, and the data is:

   time point       dirty data (MB)
	1		1
	1		1
	1		1
	1		1
	1		10

Then a more accurate integral result should be: 1+1+1+1+10 = 14. But by
your "rectangular integration" the result will be 10*5 = 50.

Correct me if I am wrong, IMHO 14:50 is a big difference.

quoted

In my current understanding, the effect of the above calculation is to
make a derivative value being writeback_rate_update_seconds times big.
So it is expected to be faster than current PD controller.

The purpose of the proportional term is to respond immediately to how
full the buffer is (this isn't a derivative value).

If we consider just the proportional term alone, with its default
value of 40, and the user starts writing 1000 sectors/second...
eventually error will reach 40,000, which will cause us to write 1000
blocks per second and be in equilibrium-- but the amount filled with
dirty data will be off by 40,000 blocks from the user's calibrated
value.  The integral term works to take a long term average of the
error and adjust the write rate, to bring the value back precisely to
its setpoint-- and to allow a good writeback rate to be chosen for
intermittent loads faster than its time constant.

quoted

I see 5 sectors/second is faster than 1 sectors/second, is there any
other benefit to change 1 to 5 ?

We can set this back to 1 if you want.  It is still almost nothing,
and in practice more will be written in most cases (the scheduling
targeting writing 1/second usually has to write more).

1 is the minimum writeback rate, even there is heavy front end I/O,
bcache still tries to writeback at 1 sectors/second. Let's keep it in 1,
so give the maximum bandwidth to frond end I/Os for better latency and
throughput.

quoted

+     dc->writeback_rate_p_term_inverse = 40;
+     dc->writeback_rate_i_term_inverse = 10000;

How the above values are selected ? Could you explain the calculation
behind the values ?

Sure.  40 is to try and write at a rate to retire the current blocks
at 40 seconds.  It's the "fast" part of the control system, and needs
to not be too fast to not overreact to single writes.  (e.g. if the
system is quiet, and at the setpoint, and the user writes 4000 blocks
once, the P controller will try and write at an initial rate of 100
blocks/second).  The i term is more complicated-- I made it very slow.
It should usually be more than the p term squared * the calculation
interval for stability, but there may be some circumstances when you
want its control to be more effective than this.  The lower the i term
is, the quicker the system will come back to the setpoint, but the
more potential there is for overshoot (moving past the setpoint) and
oscillation.

To take a numerical example with the case above, where the P term
would end up off by 40,000 blocks, each 5 second update the I
controller would be increasing the rate by 20 blocks/second initially
to bring that 40,000 block offset under control

Oh, I see.

It seems what we need is just benchmark numbers for latency
distribution. Once there is no existed data, I will get a data set by
myself. I can arrange to start the test by end of this month, now I
don't have continuous access to a powerful hardware.

Thanks for the above information :-)

-- 
Coly Li

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help