The Hot Aisle Logo
Fresh Thinking on IT Operations for 100,000 Industry Executives

Back in 2008, Steve O’Donnell wrote an article here on The Hot Aisle explaining one of the challenges he set his team during his time at BT, the difficult task of getting Asset Management right.

To summarise, Steve kicked off an audit of the whole estate, and where owners couldn’t be found for kit on the floor, the hard line was taken of switching it off.  In some cases developers and engineers got annoyed when their precious server was threatened with shutdown, and when it was explained why it was being turned off, there was a surge of people updating the CMDB and making sure that nothing was left unaccounted for where it was required.

It soon became apparent that much of this kit was no longer in use and it enabled BT to switch off 10% of its server estate with a cost saving of roughly $7M in electricity costs alone.

Job done?  Absolutely not.

Over the coming days, I will be blogging about the power real knowledge of your Data Centre estate can bring, the issues it will help eliminate, and tools that I have developed to harness this data and provide automated, management reports to data centre managers, strategic data centre planners and space management boards alike.

First up, Power Outages and Load Balancing.

As the demand for Data Centre space increased, BT faced the difficult issue of power outages.  PDUs were regularly tripping causing a fail over to other PDUs. It became apparent that we faced the risk of cascade failure where a single PDU tripping out could swamp others and cause a data centre to fail.

However, we realised that wasn’t simply a case of “That’s it, we’ve used all our PDU capacity, we need to invest in new ones!”

Over the years,  the loading of PDUs hadn’t always been done methodically and fully thought through.  It was guessed at that perhaps, PDU1 was running at about 30%, so let’s attach this new server to that and to be on the safe side, lets dual feed it to PDU3.  Often the PDU attachment was never recorded for a server.

When power demand started getting high, problems were encountered.  There wasn’t even load balancing on PDUs, and what’s more, there were no records to identify where this load balancing needed addressing.

The simple question of “Where of my business critical apps?” could easily be answered following the clean up and continued management of the CMDB, but the question of “are these apps running on equipment which is resilient to power failures, dual fed, on evenly loaded PDUs?” could not!

There was a gap in our knowledge and reliable power feeds were are risk because of it!

I spoke to Steve about this and said that if we have a record of all PDUs within a site, their capacities, and the kit that they are feeding, I can provide you with the following reports to quickly tackle this problem and set the guys targets of where to begin.

1)     The load on each PDU within a data centre, including KW used, KW Remaining, % Loaded, % Free

2)     A list of all single, dual and triple fed equipment and the load of each PDU feeding that equipment

3)     A list of all single fed equipment holding business critical applications, dual fed and triple fed equipment hosting non critical, development environments

How were we able to answer these questions?  Most of the data was already there!  An audit of the estate provided the location of kit, the kit models, the applications that ran on them.  Knowing the kit model meant that we could integrate 3rd party data telling us the theoretical power utilisation of that kit (which could be factored down to provide more accurate, real world figures).

Once these reports are available, we could then go about go about phase 1, resolving these load balancing issues and deciding where we may need to invest in added PDU capacity.

So the audit began, I ran a report with a list of equipment, rack by rack, and the M&E guys went about collecting the data and feeding it back to me.

In the meantime, I developed the tool which would digest this data and return the reports as promised.  The tool was web based and was securely accessed over the Intranet, access was managed and given to those who needed the information, and phase 1 of the Data Centre Power Tool was complete.

The process of developing this phase of the tool was literally knocked up over night.  We had the correct processes in place for gathering the data, I had the skills to manage the team and communicate exactly what was needed and more importantly, why I was asking for it, and I used my knowledge of data centre infrastructure alongside development skills to begin the process of developing a powerful and invaluable management tool.  And of course, I had Steve to call upon should we run into any stumbling blocks!

The load balancing was soon addressed, some new PDUs were purchased and installed, and the whole operation began to run a lot smoother.  Processes were put in place to record and maintain the PDU linkages to kit inside the CMDB and the Load Balancing tool was left within BT for continued use.

The process of getting the data right within the CMDB, our collective understanding of data centre infrastructure and the development skills on hand helped solve a problem that could have been very expensive and very embarrassing within weeks.  The process of integrating this tool with the client CMDB, meant that this sort of issue should never arise again within BT.

Later, I will blog about how I developed this tool into a powerful strategic planning system which was used by both the M&E Infrastructure Team and the Space Management Board to aid the process of planning “where to place equipment” in our data centres.

There Are 6 Responses So Far. »

  1. […] we have an article on preventing cascading failures in the data center. Balancing power loads is absolutely critical to data center reliability, especially when 3 phase […]

  2. […] } This morning I was reading about preventing cascading failures in the data centervia power load balancing. I’ve written previously about not allowing data center failures to […]

  3. Nice article Pete. Very interesting and look forward to the next one..

  4. […] This post was mentioned on Twitter by Steve O'Donnell and vburke, Martin Williams. Martin Williams said: RT @stephenodonnell: Knowledge is Power […]

  5. 1974 I was responsible for the Northern end of NORAD/SAC’s 4300 mile microwave link.
    I didn’t need anybody to tell me that power outages were a Bad Thing.

    2010 … and people go on at length, sounding


  6. Hi Ben, I think the idea behind this isn’t to sound sophisticated, but to put out there that with the correct business processes and tooling, the risk of power outages can be managed and reduced. We know that power outages are bad. Managing that risk is a different matter.

    Past experience has shown that despite how obvious the management of this may sound, it isn’t always implemented successfully. The article is just my experience of how we’ve successfully reduced this risk, increased reliability, retained customer satisfaction, and ultimately saved money.

Post a Response