Re: [PATCH] bcache: Fix writeback_thread never writing back incomplete stripes.
From: Denis Bychkov <hidden>
Date: 2015-09-19 04:47:38
Also in:
lkml
Hi Kent, After running a day with this new version of your patch, I did not notice any problems, so I assume, it works. I'll keep an eye on it and report back if find anything bad. I believe, you finally fixed that long lasting bug with writeback threads spinning CPU. To Vojtech Pavlik: thank you for catching it! For a long time I had to run bcache with the partial_stripes_expensive branch disabled, which seriously affected its performance on my RAID-6. On Thu, Sep 17, 2015 at 2:31 PM, Kent Overstreet [off-list ref] wrote:
quoted hunk ↗ jump to hunk
On Thu, Sep 17, 2015 at 08:40:54AM -0800, Kent Overstreet wrote:quoted
On Thu, Sep 17, 2015 at 11:30:17AM -0400, Denis Bychkov wrote:quoted
Well, it turns out my celebration was a bit premature. PLEASE, DO NOT APPLY THE PATCH POSTED BY KENT (not the one Vojtech posted) ON A PRODUCTION SYSTEM, IT CAUSES DATA CORRUPTION. The interesting thing is that it somehow damaged the partition that was not supposed to receive any writes (the file system was mounted read-only), so my guess is that the patch causes the blocks residing in the write-back cache to flush to the wrong blocks on the backing device. Everything was going great until I rebooted and saw this in the log: [ 19.639082] attempt to access beyond end of device [ 19.643984] md1p2: rw=1, want=75497520, limit=62914560 [ 19.659033] attempt to access beyond end of device [ 19.663929] md1p2: rw=1, want=75497624, limit=62914560 [ 19.669447] attempt to access beyond end of device [ 19.674338] md1p2: rw=1, want=75497752, limit=62914560 [ 19.679195] attempt to access beyond end of device [ 19.679199] md1p2: rw=1, want=75498080, limit=62914560 [ 19.689007] attempt to access beyond end of device [ 19.689011] md1p2: rw=1, want=75563376, limit=62914560 [ 19.699055] attempt to access beyond end of device [ 19.699059] md1p2: rw=1, want=79691816, limit=62914560 [ 19.719246] attempt to access beyond end of device [ 19.724144] md1p2: rw=1, want=79691928, limit=62914560 ...... (it's a small example, the list was much longer) And the next thing I found out the super block on my 10-Tb XFS RAID was gone. :) Oh well, it's a good thing I have backups. I knew what I was doing when trying the untested patches. I should have made the RAID md partition read-only, not the file system. I kind of expected that something could have gone wrong with the file system I was testing, just did not expect it would fire nukes at the innocent bystanders.Aw, shit. That's just _bizzare_. I have a theory - it appears that last_scanned isn't getting initialized before it's used, so it's going to be all 0s the very first time... which it appears could cause it to slurp up keys from the wrong device (and if that device was bigger than the correct device, that could explain the accesses beyond the end of the device). Currently just a theory though, and I have no clue why it would only be exposed with my patch.Here's an updated patch that has a fix for _that_ theory, and also a new BUG_ON(). Any chance you could test it? Oh - I didn't ask - _do_ you have multiple backing devices attached to the same cache set? Because if you don't, this isn't it at all... -- >8 -- Subject: [PATCH] bcache: Change refill_dirty() to always scan entire disk if necessary Previously, it would only scan the entire disk if it was starting from the very start of the disk - i.e. if the previous scan got to the end. This was broken by refill_full_stripes(), which updates last_scanned so that refill_dirty was never triggering the searched_from_start path. But if we change refill_dirty() to always scan the entire disk if necessary, regardless of what last_scanned was, the code gets cleaner and we fix that bug too. Signed-off-by: Kent Overstreet <redacted> --- drivers/md/bcache/writeback.c | 37 ++++++++++++++++++++++++++++++------- 1 file changed, 30 insertions(+), 7 deletions(-)diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c index cdde0f32f0..d383024247 100644 --- a/drivers/md/bcache/writeback.c +++ b/drivers/md/bcache/writeback.c@@ -310,6 +310,10 @@ void bcache_dev_sectors_dirty_add(struct cache_set *c, unsigned inode, static bool dirty_pred(struct keybuf *buf, struct bkey *k) { + struct cached_dev *dc = container_of(buf, struct cached_dev, writeback_keys); + + BUG_ON(KEY_INODE(k) != dc->disk.id); + return KEY_DIRTY(k); }@@ -359,11 +363,24 @@ next: } } +/* + * Returns true if we scanned the entire disk + */ static bool refill_dirty(struct cached_dev *dc) { struct keybuf *buf = &dc->writeback_keys; + struct bkey start = KEY(dc->disk.id, 0, 0); struct bkey end = KEY(dc->disk.id, MAX_KEY_OFFSET, 0); - bool searched_from_start = false; + struct bkey start_pos; + + /* + * make sure keybuf pos is inside the range for this disk - at bringup + * we might not be attached yet so this disk's inode nr isn't + * initialized then + */ + if (bkey_cmp(&buf->last_scanned, &start) < 0 || + bkey_cmp(&buf->last_scanned, &end) > 0) + buf->last_scanned = start; if (dc->partial_stripes_expensive) { refill_full_stripes(dc);@@ -371,14 +388,20 @@ static bool refill_dirty(struct cached_dev *dc) return false; } - if (bkey_cmp(&buf->last_scanned, &end) >= 0) { - buf->last_scanned = KEY(dc->disk.id, 0, 0); - searched_from_start = true; - } - + start_pos = buf->last_scanned; bch_refill_keybuf(dc->disk.c, buf, &end, dirty_pred); - return bkey_cmp(&buf->last_scanned, &end) >= 0 && searched_from_start; + if (bkey_cmp(&buf->last_scanned, &end) < 0) + return false; + + /* + * If we get to the end start scanning again from the beginning, and + * only scan up to where we initially started scanning from: + */ + buf->last_scanned = start; + bch_refill_keybuf(dc->disk.c, buf, &start_pos, dirty_pred); + + return bkey_cmp(&buf->last_scanned, &start_pos) >= 0; } static void bch_writeback(struct cached_dev *dc) --2.5.1
-- Denis