Thread (18 messages) 18 messages, 5 authors, 2014-08-06

Re: On URE and RAID rebuild - again!

From: NeilBrown <hidden>
Date: 2014-08-04 23:29:51

On Tue, 05 Aug 2014 00:44:04 +0200 Gionatan Danti [off-list ref] wrote:
Il 2014-08-04 20:40 Mikael Abrahamsson ha scritto:
quoted
Why do you think that's wrong? 10^-14 is what the vendor guarantees. I
have had drives with worse performance (after a couple of months I had
several UNC sectors without reading much).

Your claim about the article being wrong is the same as saying that
the risk reported of getting into a car accident is wrong because
you've driven that amount of kilometers but haven't been in an
accident yet.

This is statistics, marketing and warranty, not guaranteed behavior.
Yes, I understand this. However, the linked article (and many others) 
state:
"If you have a 2TB drive, you write 2TB to it, and then you fully read 
that, just over 6 times, then you will run into one read error, 
theoretically speaking."
This statement is wrong, and doesn't even make any sense.  It displays a deep
misunderstanding of probability (the same deep misunderstanding that leads
people to buy lottery tickets).
I read my 500 GB drive over _60_ times, reading 3x more total data than 
stated above.

I started the entire discussion to know how UREs are calculated, trying 
to understand if they are expressed as probability ("1 probabily over 
10^14 that we can not read a sector) or a statistical record ("we found 
that 1 on 10^14 is not readable").
Probabilities are often calculated by examining a statistical record - the
two concepts are not separate.
There is probably some theoretical analysis, some statistical analysis, some
marketing and maybe even some actuarial analysis that goes in to the quoted
figure.  I remember when CPU speed was measured in "MIPS".
This stood for 
   Meaningless Indicators of Performance for Salesmen

URE rates numbers are probably equally trustworthy.
If defined as a probability, I am very lucky: if my math is OK, I should 
have only 0.5% to read about 40 TB of data (my math is: 
(1-(1/10^14))^(3*(10^14))). If, on the other hand, UREs are defined as 
statistical evidence (as MTBF), environment and test conditions (eg: 
duty cycle, read/write distribution, etc) are absolutely critical to 
understand  what this parameter really mean for us.
The probability number doesn't tell you much at all about your drive.
Your drive probably works much better than the quoted rate, but could be much
worse.
The quoted number might say something useful about a collection of 10,000
drives, but if you can afford those, you can probably afford to competent
statistician to explain the details too.

I'm under impression (and maybe I'm wrong, as usual :)) that UREs mainly 
depends on incomplete writes and/or unsable sectors. If this is the 
case, maybe the published URE values are related to the entire HDD 
warranty. In other word, they should be read as "in normal condition, 
with typical loads, out HDD will exibit about 1/10^14 unrecoverable 
error during the entire disk lifespan".
I'm not an electro-magnetic engineer, but I would guess that UREs are caused
by some combination of:
 - irregularities in the physical media
 - imperfections in positioning of the write head
 - fluctuations in temperature and pressure which could
   affect precise performance of resistors and capacitors etc.

and probably various quantum effects that I know nothing about.

Maybe most UREs come from a spec of dust that was in the wrong place at the
wrong time.

If think a better summary would be:
  in normal conditions and typical loads, a collection of 10^14 drives will
  exhibit errors somewhere in the collection on a regular basis.
It is reasonable? Or I am horribly wrong?
Regards.
NeilBrown

Attachments

Keyboard shortcuts
hback out one level
jnext message in thread
kprevious message in thread
ldrill in
Escclose help / fold thread tree
?toggle this help