We all worry about failures in our data centers, some types of failure more than others. Losing a server is bad, losing a storage array extremely bad! How then might we categorize a full power failure in a data centre? Catastrophic certainly, even careless perhaps?
I can hear the questions already:
How could I have a full power outage? I have resilience, I have spare power equipment configured to take over seamlessly if I loose a power stream!
Ha ha, it can’t happen to me, I thought about this and spent myself out of trouble, I bought a Tier 4 site at huge expense so that I can be certain that catastrophic power failure cannot occur. I had Engineers prove that we could survive any failure scenario.
Well here is the rub, if you have not taken extreme care when adding and removing each piece of equipment, your nice expensive data center is about as reliable as your home!
Data Centers usually fail because of people screwing them up.
Lets take a real life example, we have two PDUs supporting a row of racks, because we have not paid attention when adding one new server, one of the PDUs is running at 51% capacity and the other at 50%. We have a technical failure in one of the power streams feeding one of our PDUs and it shuts down. We have been smart and all of our equipment is dual attached to both PDUs so everything should be OK. Unfortunately not. Because we now need to supply 101% of capacity through one PDU it will fail and the whole row of servers will shut down! Ouch, we might even get some smoke and a big bang!
The same thing can happen to UPS equipment - those of us in the UK might remember when Level 3 dropped their Goswell Road Data Center in 2006.
I have written about Reflector before with regard to how it can be used to manage data center moves and migrations but I wanted to tell you about a new feature that my friend, Martin Williams of Glasshouse Technologies has fitted to Reflector:
Real Time Power Stream Monitoring and Management. It is unique and absolutely brilliant - a fantastic use of the capabilities of modern data center M&E equipment to report on loadings and throughput. In my opinion now that it is available and proven we better all start using it. I understand that there is a huge interest from the M&E equipment manufacturers who want to get integrated into Reflector.
The screenshot above shows part of what Martin demonstrated to me, a real time display of the status of every key piece of M&E plant in my data center. Real time because it was taking feeds from real live PDUs like the Mardix iPDU in the picture below, and other plant and pulling together my dashboard.
That in itself is just amazing but the add on bit was incredible. Martin showed me how he had built the software to do impact analysis. For example, the impact of a simulated failure, what if this UPS tripped out? How would the load reconfigure? Are there any devices that are at risk? What happens if two devices fail? Because it does it all with pictures it is completely idiot proof and gave me a better vision of one of my data centers in 5 minutes than if I had spent days with a calculator, spreadsheet and wiring diagrams.
How long did it take to put the data in to do the analysis? About 10 minutes and it was only as long as that because I had to go and pull the data out of a paper file. Staggeringly good.
With this technology in place it becomes simple to keep your tier 4 data center protected and safe in real time. Without it it might be a good idea to keep an eye on the job adverts! Glasshouse are really showing thought leadership in this space, keep it up Martin.






















Add New Comment
Viewing 2 Comments
Thanks. Your comment is awaiting approval by a moderator.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Do you already have an account? Log in and claim this comment.
Add New Comment