Follow

Has anyone seen an HDD with bit error rate lower than 10^-15 ?

I've looked at latest WD Red Pros and WD Ultrastars and they all have bit error rate 10^15, even though they're like 10TB = 10^13 * 8 =~ 10^14 bits big...

so like, 10% chance a RAID rebuild will fail? That's huge...

· · Web · 1 · 1 · 2
@wolf480pl Well this is probably why stuff like ZFS has checksums.

@lanodan are they error-correcting codes, or just error-detecting?

@lanodan well those can only turn a silent data corruption into an unrecoverable read error.

I already have an unrecoverable read error (HDDs have built-in per-sector checksums).

If you have a fully operational RAID1, and you encounter an URE, you just read the value of that sector from the other disk and overwrite the faulty one, hoping the HDD will reallocate it.

But that requires (1) stumbling upon that URE while you still have a non-degraded array, and (2) HDD correctly reallocating

@lanodan If the bad sector develops while you're not reading it, neither HDD's nor ZFS's checksums will tell you about it.
A patrol read could probably help, but idk.

Then at a later time, when you see an URE on something that you do read, and then you get write error when trying to fix it, you set the drive to failed, and replace it, right? But if the previous bad sector you didn't detect happens to be on the other drive, you will find it during rebuild, when you only have 1 copy of the data.

@lanodan 3-disk RAID1, RAID6, etc. should be a solution for that though.

@wolf480pl Yeah, at some point it's just fighting the forces of chaos.

@lanodan well, when rebuilding onto 3rd drive from 2 "good" ones, a rebuild will only fail when the same sector number is bad on both "good" drives, which is squarely smaller. So for 3x10TB RAID1 you'd get ~3x10^-13 chance for a failed rebuild. Which sounds pretty damn good to me. But then, you're only using 1/3 of your total capacity, which sucks.

I wonder what the numbers are for 4-drive RAID6

@lanodan ok for 4-drive RAID6 it's just 3x higher chance, because out of 3 sectors, 2 need to fail for a failed rebuild.
So 10^-12 chance for failed rebuild, and you're using half of the raw capacity.

@wolf480pl @lanodan ZFS provides periodic scrubs for that case, going over all written data and verifying the checksums to correct the errors on the data that has not been read recently (which would get auto-corrected then).

@mbernabe @lanodan so a partol read? Yeah I said lower kn the thread that id'd probably help

Sign in to participate in the conversation
Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!