Re: dup vs raid1 in single disk

From: Alejandro R. Mosteo <hidden>
Date: 2017-01-21 16:00:58

Thanks Austin and Roman for the interesting discussion.

Alex.

On 19/01/17 21:02, Austin S. Hemmelgarn wrote:

On 2017-01-19 13:23, Roman Mamedov wrote:

quoted

On Thu, 19 Jan 2017 17:39:37 +0100
"Alejandro R. Mosteo" [off-list ref] wrote:

quoted

I was wondering, from a point of view of data safety, if there is any
difference between using dup or making a raid1 from two partitions in
the same disk. This is thinking on having some protection against the
typical aging HDD that starts to have bad sectors.

RAID1 will write slower compared to DUP, as any optimization to make 
RAID1
devices work in parallel will cause a total performance disaster for 
you as
you will start trying to write to both partitions at the same time, 
turning
all linear writes into random ones, which are about two orders of 
magnitude
slower than linear on spinning hard drives. DUP shouldn't have this 
issue, but
still it will be twice slower than single, since you are writing 
everything
twice.

As of right now, there will actually be near zero impact on write 
performance (or at least, it's way less than the theoretical 50%) 
because there really isn't any optimization to speak of in the 
multi-device code.  That will hopefully change over time, but it's not 
likely to do so any time in the future since nobody appears to be 
working on multi-device write performance.

quoted

You could consider DUP data for when a disk is already known to be 
getting bad
sectors from time to time -- but then it's a fringe exercise to try 
and keep
using such disk in the first place. Yeah with DUP data DUP metadata 
you can
likely have some more life out of such disk as a throwaway storage 
space for
non-essential data, at half capacity, but is it worth the effort, as 
it's
likely to start failing progressively worse over time.

In all other cases the performance and storage space penalty of DUP 
within a
single device are way too great (and gained redundancy is too low) 
compared
to a proper system of single profile data + backups, or a RAID5/6 
system (not
Btrfs-based) + backups.

That really depends on your usage.  In my case, I run DUP data on 
single disks regularly.  I still do backups of course, but the 
performance is worth far less for me (especially in the cases where 
I'm using NVMe SSD's which have performance measured in thousands of 
MB/s for both reads and writes) than the ability to recover from 
transient data corruption without needing to go to a backup.

As long as /home and any other write heavy directories are on a 
separate partition, I would actually advocate using DUP data on your 
root filesystem if you can afford the space simply because it's a 
whole lot easier to recover other data if the root filesystem still 
works.  Most of the root filesystem except some stuff under /var 
follows a WORM access pattern, and even the stuff that doesn't in /var 
is usually not performance critical, so the write performance penalty 
won't have anywhere near as much impact on how well the system runs as 
you might think.

There's also the fact that you're writing more metadata than data most 
of the time unless you're dealing with really big files, and metadata 
is already DUP mode (unless you are using an SSD), so the performance 
hit isn't 50%, it's actually a bit more than half the ratio of data 
writes to metadata writes.

quoted

On a related note, I see this caveat about dup in the manpage:

"For example, a SSD drive can remap the blocks internally to a single
copy thus deduplicating them. This negates the purpose of increased
redunancy (sic) and just wastes space"

That ability is vastly overestimated in the man page. There is no 
miracle
content-addressable storage system working at 500 MB/sec speeds all 
within a
little cheap controller on SSDs. Likely most of what it can do, is just
compress simple stuff, such as runs of zeroes or other repeating byte
sequences.

Most of those that do in-line compression don't implement it in 
firmware, they implement it in hardware, and even DEFLATE can get 500 
MB/second speeds if properly implemented in hardware.  The firmware 
may control how the hardware works, but it's usually hardware doing 
heavy lifting in that case, and getting a good ASIC made that can hit 
the required performance point for a reasonable compression algorithm 
like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI 
work.

quoted

And the DUP mode is still useful on SSDs, for cases when one copy of 
the DUP
gets corrupted in-flight due to a bad controller or RAM or cable, you 
could
then restore that block from its good-CRC DUP copy.

The only window of time during which bad RAM could result in only one 
copy of a block being bad is after the first copy is written but 
before the second is, which is usually an insanely small amount of 
time.  As far as the cabling, the window for errors resulting in a 
single bad copy of a block is pretty much the same as for RAM, and if 
they're persistently bad, you're more likely to lose data for other 
reasons.

That said, I do still feel that DUP mode has value on SSD's.  The 
primary arguments against it are:
1. It wears out the SSD faster.
2. The blocks are likely to end up in the same erase block, and 
therefore there will be no benefit.

The first argument is accurate, but not usually an issue for most 
people.  Average life expectancy for a decent SSD is well over 10 
years, which is more than twice the usual life expectancy for a 
consumer hard drive.  Putting it in further perspective, the 575GB 
SSD's have been running essentially 24/7 for the past year and a half 
(13112 hours powered on now), and have seen just short of 25.7TB of 
writes over that time.  This equates to roughly 2GB/hour, which is 
well within typical desktop usage.  It also means they've seen more 
than 44.5 times their total capacity in writes.  Despite this, the 
wear-out indicators all show that I can still expect at least 9 years 
more of run-time on these.  Normalizing that, that means I'm likely to 
see between 8 and 12 years of life on these.  Equivalent stats for the 
HDD's I used to use (NAS rated Seagate drives) gave me a roughly 3-5 
year life expectancy, less than half that of the SSD.  In both cases 
however, you're talking well beyond the typical life expectancy of 
anything short of a server or a tight-embedded system, and worrying 
about a 4-year versus 8-year life expectancy on your storage device is 
kind of pointless when you need to upgrade the rest of the system in 3 
years.

As far as the second argument against it, that one is partially 
correct, but ignores an important factor that many people who don't do 
hardware design (and some who do) don't often consider. The close 
temporal proximity of the writes for each copy are likely to mean they 
end up in the same erase block on the SSD (especially if the SSD has a 
large write cache).  However, that doesn't mean that one getting 
corrupted due to device failure is guaranteed to corrupt the other.  
The reason for this is exactly the same reason that single word errors 
in RAM are exponentially more common than losing a whole chip or the 
whole memory module: The primary error source is environmental noise 
(EMI, cosmic rays, quantum interference, background radiation, etc), 
not system failure.  In other words, you're far more likely to lose a 
single cell (which is usually not more than a single byte in the MLC 
flash that gets used in most modern SSD's) in the erase block than the 
whole erase block.  In that event, you obviously have only got 
corruption in the particular filesystem block that that particular 
cell was storing data for.

There's also a third argument for not using DUP on SSD's however:
The SSD already does most of the data integrity work itself.
This is only true of good SSD's, but many do have some degree of 
built-in erasure coding in the firmware which can handle losing large 
chunks of an erase block and still return the data safely. This is 
part of the reason that you almost never see nice power-of-two sizes 
for flash Storage despite flash chips being made that way them,selves 
(the other part is the spare blocks). Depending on the degree of 
protection provided by this erasure coding, it can actually cancel out 
my argument against argument 2.  In all practicality though, that 
requires you to actually trust the SSD manufacturer to have 
implemented things properly for it to be a valid counter-argument, and 
most people who would care enough about data integrity to use BTRFS 
for that reason are not likely to trust the storage device that much.
-- 
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

`h`	back out one level
`j`	next message in thread
`k`	previous message in thread
`l`	drill in
`Esc`	close help / fold thread tree
`?`	toggle this help