It is 2010 and RAID5 still works …

Some years ago (2007, 2008) when I cared a little more about things like RAID and RAID recovery, I read an article in ZDNET by Robin Harris that made the case for why disk capacity increases coupled with an almost invariant URE (Unrecoverable Read Error) rate meant that RAID5 was dead in 2009. A follow-on article appeared recently, also by Robin Harris that extends the same logic and claims that RAID6 would stop working in 2019.

The crux of the argument is this. As disk drives have become larger and larger (approximately doubling in two years), the URE has not improved at the same rate. URE measures the frequency of occurrence of an Unrecoverable Read Error and is typically measured in errors per bits read. For example an URE rate of 1E-14 (10 ^ -14) implies that statistically, an unrecoverable read error would occur once in every 1E14 bits read (1E14 bits = 1.25E13 bytes or approximately 12TB).

Further, Robin argues that a RAID array (RAID5 or RAID6) is running normally when a drive suffers a catastrophic failure that prompts a reconstruction from parity. In that scenario, it is perfectly conceivable that while reading the (N-1) data drives and the parity stripe in order to rebuild the failed data drive, a single URE may occur. That URE would render the RAID volume failed.

The argument is that as disk capacities grow, and URE rate does not improve at the same rate, the possibility of a RAID5 rebuild failure increases over time. Statistically he shows that in 2009, disk capacities would have grown enough to make it meaningless to use RAID5 for any meaningful array.

So, in 2007 he wrote:

RAID 5 protects against a single disk failure. You can recover all your data if a single disk breaks. The problem: once a disk breaks, there is another increasingly common failure lurking. And in 2009 it is highly certain it will find you.

and in 2009, he wrote:

SATA RAID 6 will stop being reliable sooner unless drive vendors get their game on. More good news: one of them already has.

The logic proposed is accurate but, IMHO, incomplete. One important aspect that the analysis fails to point out is something that RAID vendors have already been doing for many years now.

Image courtesy of http://www.computer-history.info

When disk drives looked like this (picture at right), the predominant failure mode was the catastrophic failure. Drives either worked or didn’t work any longer. At some level, that was a reflection of the fact that the Drive Permanent Failure (DPF) frequency was significantly higher than the URE frequency, and therefore the only observed failure mode was catastrophic failure.

As drives got bigger, and certainly in 1988 when Patterson and others first proposed the notion of RAID, it made perfect sense to wait for a DPF and then begin drive reconstruction. The possibility of a URE was so low (given drive capacities) that all you had to worry about was the rebuild time, and the degraded performance during the rebuild (as I/O’s may have to be satisfied through block reconstruction).

But, that isn’t how most RAID controllers today deal with drive URE’s and drive failures. On the contrary, for some time now, RAID controllers (at least the recent ones I’ve read about) have used better methods to determine when to perform the rebuild.

A 5400 RPM SATA DriveConsider this alternative, that I know to be used by at least a couple of array vendors. When a drive in a RAID volume reports a URE, the array controller increments a count and satisfies the I/O by rebuilding the block from parity. It then performs a rewrite on the disk that reported the URE (potentially with verify) and if the sector is bad, the microcode will remap and all will be well.

When the counter exceeds some threshold, and with the disk that reported the URE still in a usable condition, the RAID controller will begin the RAID5 recovery. Robin is correct that RAID recovery after DPF is something that will become less and less useful as drive capacities grow. But, with improvements in integration of SMART and the significant improvements in the predictability of drive failures, the frequency of RAID5 and RAID6 reconstruction failures are dramatically lower than those predicted in the referenced articles as these reconstructions occur on URE and not DPF.

Look at the specifications for the RAID controller you use.

When is RAID recovery initiated? Upon the occurrence of an Unrecoverable Read Error (URE) or upon the occurrence of a Drive Permanent Failure (DPF)?

Several have proposed ZFS with multiple copies is the way to go. While it addresses the issue, I submit to you that it is at the wrong level of the stack. Mirroring at the block level, with the option to have multiple mirrors is the correct (IMHO) solution. Disk block error recovery should not be handled in the file system.

9 thoughts on “It is 2010 and RAID5 still works …”

  1. What we are hoping is that a rebuild is not affected by a previously unknown URE on another drive in the RAID set. The larger the drives are, the more likely you are going to see a URE during the bebuild of the first (failed) drive.

    So there’s the problem, right? If you are already rebuilding a drive, and there is a URE on another drive, how can the data for that sector be rebuilt? It can’t.

    (NOTE: the data could be rebuilt if the original failed drive is still in the array, and has not experienced a DPF. If that drive is still online, AND the original data can be mined from that drive’s sector, then the unanticipated URE on the other drive would not stop the rebuild. This is total speculation on my part, and I don’t know if any array manufacturers write their firmware this way).

    As drive sizes continue to grow, and URE stays the same, we are getting closer the day where we are guaranteed that there will be a failure during the rebuild. Even though we aren’t there today — i.e., RAID 5 is not totally broken, we are close enough in specific instances: large RAID 5 arrays, built on big SATA drives.

    Many customers are using SATA arrays for second- and third-tier storage and when they go SATA, they want big drives. My customers are chomping (champing?) at the bit for 2TB drives now, and will be demanding 3 and 4 TB drives in the coming 12-18 months.

    For them, in those situations, RAID 5 broken now. If there is a high probability (anything over, say, 5% IMHO) that the rebuild will fail, then the “redundancy” of the data is failing and RAID 5 is broken. Of course, I pulled that 5% out of thin air — you may disagree. But it appears that there is a higher than 5% chance, right now — today — that a RAID 5 rebuild on a large array with large disks will fail.

    Thankfully, as I said these large SATA implementations are on tier 2 or 3 storage, or as part of a disk-to-disk-to-tape mechanism, meaning it’s not a primary copy of data anyhow.

    RAID 5 is broken, just not for all circumstances. Today’s admins need to be aware of the consequences of scenarios that can lead to a failed rebuild. Most admins have never even questioned RAID 5’s viability, making the situation a little worse.

    My question would be, what about RAID 50? Does that help or hurt?

    Like

  2. I don’t get why this is a remediation plan. At the point where the SMART recovery begins, why are we in any better shape than a UDF? Is there some ability to still use the disk during that recovery in the event of a UDF during the recovery? And, even if that were true (I don’t know that it is), aren’t we still getting more and more probably for yet another? What am I missing here?

    In terms of the RAID50 proposal, inefficient cost, diminishing returns and the MTBNDMA (Mean Time Between New Disks Made Available) makes methinks it’s a kludge. YMMV.

    Like

    1. At issue is the fact that a double fault is catastrophic with RAID5.

      Assume that a drive is DEAD. That means that it shall be failed and no further I/O’s can be sent to that drive. In that scenario, an unrecoverable read error from any of the N-1 drives is a catastrophic failure of the volume.

      On the other hand, assume that a drive is DYING. Using some mechanism (SMART, threshold, crystal ball) the firmware detects that the drive is DYING. Recovery can be initiated and a new drive built with the contents of the DYING drive. During that recovery, the only cause for a catastrophic failure is an unrecoverable read error on two drives ON THE SAME BLOCK. I’m no betting man but, that seems to be an event with a probability of TinyNumber ^ 2.

      There’s no doubt that in the event that a drive is DEAD, a single unrecoverable read error leads the volume to be lost. The trick is to reduce the likelihood of a drive going DEAD unexpectedly.

      I submit to you that in the old days, drive failures were unpredictable. Today, drive failures are largely predictable and therefore can be handled through preemptive action(s).

      Sound reasonable?

      Like

      1. @Amrith: excellent response!

        Do we know if RAID controller manufacturers handle the problem as you described above regarding the DYING drives? That is, do we know that when a drive is predicted to die, that the RAID rebuild is initiated by pulling/copying data from the dying drive (as opposed to rebuilding all data from the parity on the other drives)?

        Thanks for your input — much appreciated on my end!

        A

        Like

        1. Anthony,

          I don’t work for a drive/storage system vendor. I recollect that the ones that I used in previous jobs claimed to do this kind of predictive recovery and I checked with the local tech folks who confirmed that they do. I could not find it anywhere in their documentation beyond the fact that they allowed “advanced users” to configure thresholds.

          -amrith

          Like

  3. ok, get it now. Entirely reasonable unless I was in charge and then the probability of a catastrophic failure is an unrecoverable read error on two drives on the same block approaches 100%.

    Like

  4. There are exactly zero RAID controller manufacturers that will initiate a RAID rebuild based on SMART data coming from the drive that indicates a pending failure. It was tried back in the 1990’s, and it was a complete failure. The problem is that the actuator duty cycle increases n-1 fold during a RAID-5 (or 6) rebuild because of all the extra seeks, reads and writes associated with the rebuild activity. This causes dramatic increase in HDA temperature and resulting risk of catastrophic failure of the “marginal” drive — especially true for crappy desktop-class disks that are not built to handle more than 20% actuator duty cycle for any extended period of time. Imagine you are driving along and your engine starts failing, you start blowing oil and antifreeze out the tailpipe. In response you step on the gas with the hope of getting to the service station before the engine totally blows up. By stepping on the gas (of course) you wind up causing your engine to blow up immediately, when otherwise you could have almost certainly made it to the service station if you’d gone slowly instead.

    Hopefully you’ve got the picture.

    In addition, the RAID-5 rebuild process itself is hard on all the drives in the array, not just the “failing” drive, and UREs will increase with drive temperature across the entire array — further increasing the chances of a catastrophic failure.

    Finally, with 2, 3 and soon 4TB drives, the amount of time requires to complete a RAID-5 rebuild can now stretch into DAYS! Desktop-class disks weren’t built to withstand that kind of abuse even for a few hours…

    Given the ratio of URE probability to disk capacity, RAID-5 is dead, there’s no getting around it. Best to just get used to the idea and get on with life….

    Disclaimer: Yes, I do work for a RAID manufacturer, and yes, I have (in the past) designed RAID hardware and software.

    Like

  5. Your article seems to address a failure mode that Robin, and all opponents of RAID 5 that I know, already assume to be true. URE has no impact until a drive has failed. UREs encountered during the resilver operation are the only ones that we are discussing – and this is why RAID 1 and RAID 10 are not affected.

    So if you have a controller that rebuilds based on UREs without a failed drive you are in a category of risk far beyond what Robin has even considered.

    Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.