I just picked up a neat blog post by Martin Glassborow (@storagebod) discussing a double disk failure in a RAID 5 array which (apparently) caused lost data.
RAID is Redundant Array of Independent Disks and the number refers to the type of protection offered. RAID 5 means that there is an extra disks in the group carrying recovery information and that the data can survive a single disk failure.
Modern disks fail about once every 137 years on average (0.72993% chance of any single drive failure per year) so usually RAID 5 is OK but there is a time between when the failure occurs and the failed disk is replaced (TTR – Time to Repair) and the protection is recovered when our data is at risk.
That is what happened in the case in Martin’s blog, the second drive failed during the TTR and before the protection could be recovered. The chances of having the second drive failure in a single 8 drive RAID group during the 24 hours following the first failure are remote (I think I am right in saying is about 0.00014% or one failure in 7,144 years)?
The chance of catastrophic failure escalates with the TTR, the length of time it takes to replace the disk and have it recover. Some designs use a hot spare disk that can be brought into play straight away and then we just have 8 hours or so of recovery rebuild time (or longer on large slow disks such as 1 or 2TB SATA). Other designs use RAID 6 that have two data protection disks and can recover from two disk failures in the same RAID group with no data loss. Lose three disks on RAID 6 (that would be most clumsy and very unlucky) and we lose data.
We didn’t have RAID 6 in this case and the second drive did fail and it caused data loss. Even though the disks were (apparently) part of a bad batch, my operational experience tells me that there is another issue at play here.
I sense that this double drive failure probably means that the guys using the array didn’t notice the first disk had failed and get it replaced thereby allowing recovery to happen. So the array was running without protection for a longish while and then the second disk failed, ouch!
This sort of thing is not unusual in a development or unstructured environment or even when sytems management is misconfigured. I have seen the same thing happen with dual power supplies in servers and network switches and also in network links. If a piece of infrastructure is not being proactively maintained it WILL eventually fail, it’s just a matter of maths.
The sort of stupid problems that cause this are: dial home equipment that is misconfigured, problems with the phone line, switched off modem, misconfigured instrumentation and firewalls blocking SNMP traffic. Of course there is always the administrator being on holiday, weekends, public holidays and change of staff. Catastrophic failure is almost always caused by something really simple and stupid.
The single biggest weakness in storage, is the storage administrator, if we rely on humans they sometimes screw up. Perhaps systems that automatically recover from drive failure such as Direct Data’s WOS and IBM’s XIV without humans are likely to be more reliable in the real world.