Just how reliable is my Data Center?
April 30th, 2008 | by Steve O'Donnell |The consequences of data center failure can be pretty catastrophic for your business, so unsurprisingly, you will have very high expectations about availability and reliability. Unfortunately for many of us, our expectations will not be met by reality. Very often the system that is our data center facility is significantly less reliable than we might wish and than the level of resilience we have been told to expect.
The first thing we should look at is the general approach that Engineers have taken to building highly resillent components in our data centers and then secondly how these components are linked together to create a resillient end to end system.
Resillience Approaches. Typically high availability is driven by providing additional redundant components or systems that can take over if a primary system fails. An example of this would be dual power cords on Servers with dual power supplies that can take over if a power supply or power feed fails.
Sometimes resillience is provided on a N+1, N+2 or N+N basis. In the case of N+1, we supply a single redundant unit to backup a set of on-load units, an example might be that we need three 1MW Generator sets but we equip the site with four sets to deal with a situation where one won’t start or is in maintenance.
Application Resillience. By providing smart application code or underlying clustering software simple applications can be made resillient and able to withstand an outage. Often DNS or smart network hardware can provide resillience for Web based applications that do not need to maintain application state between transfers.
Server Resillience. Often server equipment is equipped by multiple power supplies and multiple power cords (mostly two but sometimes more on large systems). These provide protection from a single power stream failing - if and only if - each power cord is connected to a separate power stream. If connected to a power stream that has a piece of shared infrastructure, the power system has a single point of failure for that server.
Sometimes servers have additional resillience capability including dual network connections with a form of IP multi-pathing that is oblivious to a single network outage. Almost all server disk subsystems are equiped with disks connected in a redundant way to provide resillience in case of a disk failure.
Cabinet Resillience. Cabinets in data centers are almost always fitted with dual power strips so that dual power fed servers can be easilly connected to dual power feeds to maintain operation across a single power stream failure. Note that it is important that neither server power supplies are loaded above 50% capacity as a single power failure would take the remaining power supplies above 100% and cause a cascade failure. Single power fed servers are superficially a cheap alternative but they can neither withstand a power failure nor enable planned maintenance to be carried out without being taken offline.
PDU Resillience. PDUs need to be configured so that at least two PDUs supply each cabinet. This means that properly connected servers in the cabinets can survive a single PDU failure. Equally PDUs must never be loaded above 50% capacity as this would cause a cascade failure if one unit fails causing both loads to be supported by one unit.
UPS Resillience. Uninterruptable Power Supplies are usually connected in an N+1 or N+N configuration. Care must be taken to ensure that each unit is not loaded beyond the point where any unit taking on the additional loading caused by a UPS failure would suffer an overload and subsequent cascade failure. For example three 500KW units connected in N+1 could support a total load of 1MW or 333KW each in normal operation and 500KW if one unit failed. To provide an N+N design with the same load would require four UPS configured to support no more than 250KW each.
Engine Resillience. Engine design needs to follow the same rules with excess capacity needed to provide resillience through failure as well as the capability to perform maintenace without taking our data center offline.
In the real world, with equipment being moved into and out of data center halls continuously, it is almost impossible to maintain the proper power balance across each of these components and many sites drift into a situation where a cascade failure will occur when a single power component fails. Most sites recalculate every few weeks or months and then - in a panic - reconfigure and rebalance their power feeds when an inballance is detected.


















(4 votes, average: 4.75 out of 5)


1 Trackback(s)