This article on HA and FT at ReadWriteWeb caught my eye. A while ago I used to work at Stratus and it is not often that I hear their name these days. Stratus’ Fault Tolerant systems achieve their impressive uptime by hardware redundancy.
In very simple terms, if the probability of some component or sub-system failure is p, then the probability of two failures at the same time is a much smaller p * p.
When I was at Stratus, we used to guarantee “five nines”, or an uptime of 99.999% on systems that ran credit card networks, banking systems, air traffic control systems, and so on. Systems where the cost of downtime could be measured either in hundreds of thousands or millions of dollars an hour, or in human lives potentially lost.
Before I worked at Stratus, I used to work for a Stratus Customer and my first experience with Fault Tolerance was when I received a box in the mail with a note that said something to the effect that a CPU board had failed in one of our systems (about a month ago), so please pop that board out and put this replacement board in its place.
And we hadn’t realized it, the system had been chugging along just fine!
So what does uptime % translate to in terms of hours and minutes?
99% uptime : 3.65 days of downtime per year
99.9% uptime: 8.76 hours of downtime per year
99.99% uptime: 52.56 minutes of downtime per year
99.999% uptime: 5.256 minutes of downtime per year
Stratus claims that across its customer base of 8000 servers the uptime is 99.9998%
99.9998% uptime: 63 seconds of downtime per year.
Now, that’s pretty awesome!
And when I flew into Schipol Airport, or saw containers being loaded onto ships in Singapore, or I used my American Express Credit Card, logged into AOL, or looked at my 401(k) on Fidelity, I felt pretty darn proud of it!