The Hot Aisle Logo
Fresh Thinking on IT Operations for 100,000 Industry Executives

The consequences of data center failure can be pretty catastrophic for your business, so unsurprisingly, you will have very high expectations about availability and reliability. Unfortunately for many of us, our expectations will not be met by reality. Very often the system that is our data center facility is significantly less reliable than we might wish and than the level of resilience we have been told to expect.

The first thing we should look at is the general approach that Engineers have taken to building highly resillent components in our data centers and then secondly how these components are linked together to create a resillient end to end system.

Resillience Approaches. Typically high availability is driven by providing additional redundant components or systems that can take over if a primary system fails. An example of this would be dual power cords on Servers with dual power supplies that can take over if a power supply or power feed fails.

Sometimes resillience is provided on a N+1, N+2 or N+N basis. In the case of N+1, we supply a single redundant unit to backup a set of on-load units, an example might be that we need three 1MW Generator sets but we equip the site with four sets to deal with a situation where one won’t start or is in maintenance.

Application Resillience. By providing smart application code or underlying clustering software simple applications can be made resillient and able to withstand an outage. Often DNS or smart network hardware can provide resillience for Web based applications that do not need to maintain application state between transfers.

Server Resillience. Often server equipment is equipped by multiple power supplies and multiple power cords (mostly two but sometimes more on large systems). These provide protection from a single power stream failing – if and only if – each power cord is connected to a separate power stream. If connected to a power stream that has a piece of shared infrastructure, the power system has a single point of failure for that server.

Sometimes servers have additional resillience capability including dual network connections with a form of IP multi-pathing that is oblivious to a single network outage. Almost all server disk subsystems are equiped with disks connected in a redundant way to provide resillience in case of a disk failure.

Cabinet Resillience. Cabinets in data centers are almost always fitted with dual power strips so that dual power fed servers can be easilly connected to dual power feeds to maintain operation across a single power stream failure. Note that it is important that neither server power supplies are loaded above 50% capacity as a single power failure would take the remaining power supplies above 100% and cause a cascade failure. Single power fed servers are superficially a cheap alternative but they can neither withstand a power failure nor enable planned maintenance to be carried out without being taken offline.

PDU Resillience. PDUs need to be configured so that at least two PDUs supply each cabinet. This means that properly connected servers in the cabinets can survive a single PDU failure. Equally PDUs must never be loaded above 50% capacity as this would cause a cascade failure if one unit fails causing both loads to be supported by one unit.

UPS Resillience. Uninterruptable Power Supplies are usually connected in an N+1 or N+N configuration. Care must be taken to ensure that each unit is not loaded beyond the point where any unit taking on the additional loading caused by a UPS failure would suffer an overload and subsequent cascade failure. For example three 500KW units connected in N+1 could support a total load of 1MW or 333KW each in normal operation and 500KW if one unit failed. To provide an N+N design with the same load would require four UPS configured to support no more than 250KW each.

Generator Set

Engine Resillience. Engine design needs to follow the same rules with excess capacity needed to provide resillience through failure as well as the capability to perform maintenace without taking our data center offline.

In the real world, with equipment being moved into and out of data center halls continuously, it is almost impossible to maintain the proper power balance across each of these components and many sites drift into a situation where a cascade failure will occur when a single power component fails. Most sites recalculate every few weeks or months and then – in a panic – reconfigure and rebalance their power feeds when an inballance is detected.

  • Share/Bookmark

There Is 1 Response So Far. »

  1. [...] Data Center? teve O’Donnell, former Global Head of Data Centres at British Telecom has a new blog post on the subject of Data Center reliability where he explains “why your Data Center will fail eventually and you will be affected”. [...]

Post a Response

Download Full Movie Online Bromocriptine free ringtones on rant free ringtones for voyagerTriggermen download movie You and Your Stupid Mate download movie Stuey download movie The Void download movie Pit Fighter download movie Rolling Kansas download movie Crusader download movie Pavement: Slow Century download movie Manderlay download movie Godzilla vs. mothra download movie See spot run download movie Postal download movie Alone in the dark download movie Naughty stewardesses download movie Young guns ii download movie free ringtones on rant free ringtones for voyager ringtones compatible with metro pcs tv theme ringtones Yôtôden download movie Triggermen download movie You and Your Stupid Mate download movie Stuey download movie The Void download movie Pit Fighter download movie Rolling Kansas download movie Crusader download movie Pavement: Slow Century download movie

Switch to our mobile site