How Efficient is my Data Center?

May 15th, 2008

A key management principle is that you can’t manage what you can’t measure. So if we want to manage the energy efficiency of our data center estate, then we had better be in a position to baseline where we are today so that we can identify the value of the changes that we implement.

The problem with measuring anything is deciding what to measure. Measuring data center efficiency is particularly difficult, where do you start and what do you measure? A good starting point is looking at what others do in this space.

A common measure is PUE - Power Usage Effectiveness a simple measure of the total power delivered to the facility divided by the IT Equipment Power. So how, then do we measure or calculate PUE?

To get the Total Power Delivered to the Facility, and if we had instrumentation in our data center, we could measure the instantaneous power entering the building through the high voltage feed (often this feed has other uses including ancillary buildings, offices and other non-data center assets which will need to be excluded). We could alternatively look at an average and use the utility electricity meter to count the number of MW hours over a fixed period (dividing the MW hour value by the number of hours we measure over gets us to an average power value).

To get to the IT Equipment Power value we have two approaches, we could use the boilerplate values (vendor supplied power rating usually attached to the equipment on a label) for each piece of equipment and simply add them up or we could (if we are lucky to have them) use our iPDU devices to measure the total power delivered to the racks. Both of these approaches have weaknesses. In the former case, we will find that boilerplate values are very poor measures of the real power consumed as that will be often dependent on the number of accessories attached or the loading of the device (e.g. a network switch). Getting the values from the PDU is much better but will tend to be too large as often rack mounted fans and other non IT Equipment load will be hanging off the PDU feeds.

An alternative simple measure to PUE is DCE - Data Center Effectiveness, the reciprocal of PUE i.e. IT Equipment Power divided by Total Equipment Power delivered to the facility. (Another term for the same measure is DCiE - Data Center infrastructure Effectiveness).

The main weakness with these measures is that they focus exclusively on the building M&E effectiveness and fail to address efficiency in terms of computing, network and storage. In other words a good PUE does not mean that you have an effective data center if the servers are all running at 2% utilization and the storage is mostly empty. Nevertheless highly efficient M&E plant is a great place to start.

A fuller measure that is sensitive to more of the initiatives we can take to being energy efficient is CPE - Compute Power Efficiency. This is IT Equipment Utilization divided by PUE.

A facility with IT Equipment running at 5% utilization will have a much lower CPE than another with a better PUE but a 15% IT Equipment Utilization. This is a much more realistic measure of what has been achieved. Plainly equipment running at 100% utilization with a PUE of 1 is the target that we will never get to but nevertheless it should be an aspiration.

Even CPE is a rather one dimensional measure as it focusses on Servers, there are many other areas of efficiency that we could bring into the equation:

  • Network Utilization
  • Storage Utilization
  • Data Center Utilisation

Where we need to get to is a meaningful measure that is possible to calculate repeatedly (better still in real time) that we can use to compare our performance with other Data Center operators. Is it too fanciful to imagine the CPE measure being adopted by government to regulate data center operators in an energy constrained world?

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

How to Avoid Swimming in your Data Center

May 11th, 2008

Disaster

You might think that your data center is under the spotlight, constantly monitored, secure and well designed but what if I told you that typically the second most common cause of catastrophic failures (after electrical) in your Data Center are water leaks?

Would your data center survive this?

Water can come from three main sources:

  • Leaks inside the Data Center from refrigeration equipment that uses water for cooling (CRAC and Water Cooled Cabinets)
  • Leaks from inside the building, WC and kitchen facilities, heating and cooling plant for offices
  • Water from outside the building such as flood water from rivers and storm drains, ground water, rain and damaged water mains.
We need to approach water leaks in three ways:
  • Initial survey to ensure that the location of our data center is not in a flood plain and that the site is well protected from external water sources. Confirmation that the fabric of the building is well enough designed and properly maintained  to prevent the ingress of rain (even in extreme storm conditions). Check how sources of water inside the building are routed (check hot and cold water storage tanks, pipe runs, waste pipes, WCs, as well as water based fire suppression systems in the office space). Office space above a data center is almost always dangerous from a water ingress perspective. Look for opportunities for stupidity, like overflowing wash hand basins and WC units. Where would the water go?
  • Protection to ensure that any water entering the data center does not have the opportunity to build up and cause a problem. If you are really worried install drains and a sump pump under the plenum floor. Ensure the floor space is sealed and all cable routes through partition walls are stopped up to be air and water tight. This is essential for air handling efficiency and fire suppression also.
  • Monitoring is critical in a data center - we absolutely need to be able to detect water under the plenum floor. Mostly water detection systems use a cable that you run under the floor that causes an alarm to be triggered if it comes into contact with water. In a large site this can be extremely difficult to manage as the water leak can be anywhere along the length of the cable run!

There is a very competent article in Automated Buildings that covers the different types of monitoring available.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Proactive Maintenance prevents Pyrotechnics

May 11th, 2008

A PDU Exploding

Way, way back in the stone age (whilst I was a student) I worked in the Edinburgh University Computer Labs keeping all of the hardware going. I had got a report of a DEC PDP 9 (drop me a mail if you know what one of these is) that kept restarting every few days all by itself. It would power down randomly and then just as strangely power itself back up again all for no apparent reason. One of the biggest pains was that when it restarted it needed an operator to program in the bootstrap loader in Octal on the front panel keys. (I have really made myself feel old now).

Well I tried everything I could to track down this problem, focussing on the power supply (that was attached to the back door of the equipment). In total frustration after hours of leaning into the computer with a voltmeter and oscilloscope, I straightened up turned around and stared to walk away for a coffee and at that second the power supply exploded. The door flew off it’s hinges, clouds of white smoke and sparks gushed from the cabinet and the fire suppression system kicked sounding the alarm that it was about to release gas into the data center. I got out of there in a hurry.

I learned a few lessons that day, not all of them relating to running the 100 yard dash in record time. I learned that electricity is dangerous and data center equipment can explode if stressed and that there can be no visible evidence of what is going wrong until it is too late. You would think that a more sensible person would have changed career!

The root cause of the demise of the PDP 9 (see picture below - and no the guy with the nerdy haircut is not me) was a bad electrical connection to a set of very large electrolytic capacitors. These always have a tendency to explode if the terminals are reversed.

So what possible bearing does this have on modern data centers? Have a look at the two pictures taken at different times of the same circuit fuses below:

Before Repair

After Repair

They kind of look the same (mostly), except the top picture shows that the clips holding the fuses are hot (about 300 degrees Celsius) whilst the picture below shows the fuses after the clips have been repaired and replaced (these are still warm at 150 degrees Celsius but within design specs).

Want to know what happens if you don’t  do the maintenance? Look at the picture at the top of this article it show a PDU exploding. Regular inspection of all electrical joints carrying heavy currents must be conducted with thermal cameras to avoid problems such as this. When buying equipment accessibility of these joints to thermal inspection is critically important.

Enough said?

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Terminal 5 London Heathrow

May 10th, 2008

Terminal 5 is Cool For the record I used the new terminal 5 at Heathrow last week on my way to Los Angeles, and it was a pleasure. Compared to old Heathrow it was open, clean, organized and totally functional. I actually enjoyed the experience. Rosemarie loved the shops, there are lots to choose from. We used Baggage Handling and it all worked perfectly and much faster than Terminal 4 ever was. Keep it up BA it is worth the effort.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

How do I know if my Data Center is about to fall over?

May 9th, 2008

Emergency Off

We all worry about failures in our data centers, some types of failure more than others. Loosing a server is bad, loosing a storage array extremely bad! How then might we categorize a full power failure in a data centre? Catastrophic certainly, even careless perhaps?

I can hear the questions already:

How could I have a full power outage? I have resilience, I have spare power equipment configured to take over seamlessly if I loose a power stream!

Ha ha, it can’t happen to me, I thought about this and spent myself out of trouble, I bought a Tier 4 site at huge expense so that I can be certain that catastrophic power failure cannot occur. I had Engineers prove that we could survive any failure scenario.

Well here is the rub, if you have not taken extreme care when adding and removing each piece of equipment, your nice expensive data center is about as reliable as your home!

Data Centers usually fail because of people screwing them up.

Lets take a real life example, we have two PDUs supporting a row of racks, because we have not paid attention when adding one new server, one of the PDUs is running at 51% capacity and the other at 50%. We have a technical failure in one of the power streams feeding one of our PDUs and it shuts down. We have been smart and all of our equipment is dual attached to both PDUs so everything should be OK. Unfortunately not. Because we now need to supply 101% of capacity through one PDU it will fail and the whole row of servers will shut down! Ouch, we might even get some smoke and a big bang!

The same thing can happen to UPS equipment - those of us in the UK might remember when Level 3 dropped their Goswell Road Data Center in 2006.

I have written about Reflector before with regard to how it can be used to manage data center moves and migrations but I wanted to tell you about a new feature that my friend, Martin Williams of Glasshouse Technologies has fitted to Reflector:

Real Time Power Stream Monitoring and Management. It is unique and absolutely brilliant - a fantastic use of the capabilities of modern data center M&E equipment to report on loadings and throughput. In my opinion now that it is available and proven we better all start using it. I understand that there is a huge interest from the M&E equipment manufacturers who want to get integrated into Reflector.

Power Stream Criticality Diagram

The screenshot above shows part of what Martin demonstrated to me, a real time display of the status of every key piece of M&E plant in my data center. Real time because it was taking feeds from real live PDUs like the Mardix iPDU in the picture below, and other plant and pulling together my dashboard.

That in itself is just amazing but the add on bit was incredible. Martin showed me how he had built the software to do impact analysis. For example, the impact of a simulated failure, what if this UPS tripped out? How would the load reconfigure? Are there any devices that are at risk? What happens if two devices fail? Because it does it all with pictures it is completely idiot proof and gave me a better vision of one of my data centers in 5 minutes than if I had spent days with a calculator, spreadsheet and wiring diagrams.

How long did it take to put the data in to do the analysis? About 10 minutes and it was only as long as that because I had to go and pull the data out of a paper file. Staggeringly good.

With this technology in place it becomes simple to keep your tier 4 data center protected and safe in real time. Without it it might be a good idea to keep an eye on the job adverts! Glasshouse are really showing thought leadership in this space, keep it up Martin.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Just how reliable is my Data Center?

April 30th, 2008

The consequences of data center failure can be pretty catastrophic for your business, so unsurprisingly, you will have very high expectations about availability and reliability. Unfortunately for many of us, our expectations will not be met by reality. Very often the system that is our data center facility is significantly less reliable than we might wish and than the level of resilience we have been told to expect.

The first thing we should look at is the general approach that Engineers have taken to building highly resillent components in our data centers and then secondly how these components are linked together to create a resillient end to end system.

Resillience Approaches. Typically high availability is driven by providing additional redundant components or systems that can take over if a primary system fails. An example of this would be dual power cords on Servers with dual power supplies that can take over if a power supply or power feed fails.

Sometimes resillience is provided on a N+1, N+2 or N+N basis. In the case of N+1, we supply a single redundant unit to backup a set of on-load units, an example might be that we need three 1MW Generator sets but we equip the site with four sets to deal with a situation where one won’t start or is in maintenance.

Application Resillience. By providing smart application code or underlying clustering software simple applications can be made resillient and able to withstand an outage. Often DNS or smart network hardware can provide resillience for Web based applications that do not need to maintain application state between transfers.

Server Resillience. Often server equipment is equipped by multiple power supplies and multiple power cords (mostly two but sometimes more on large systems). These provide protection from a single power stream failing - if and only if - each power cord is connected to a separate power stream. If connected to a power stream that has a piece of shared infrastructure, the power system has a single point of failure for that server.

Sometimes servers have additional resillience capability including dual network connections with a form of IP multi-pathing that is oblivious to a single network outage. Almost all server disk subsystems are equiped with disks connected in a redundant way to provide resillience in case of a disk failure.

Cabinet Resillience. Cabinets in data centers are almost always fitted with dual power strips so that dual power fed servers can be easilly connected to dual power feeds to maintain operation across a single power stream failure. Note that it is important that neither server power supplies are loaded above 50% capacity as a single power failure would take the remaining power supplies above 100% and cause a cascade failure. Single power fed servers are superficially a cheap alternative but they can neither withstand a power failure nor enable planned maintenance to be carried out without being taken offline.

PDU Resillience. PDUs need to be configured so that at least two PDUs supply each cabinet. This means that properly connected servers in the cabinets can survive a single PDU failure. Equally PDUs must never be loaded above 50% capacity as this would cause a cascade failure if one unit fails causing both loads to be supported by one unit.

UPS Resillience. Uninterruptable Power Supplies are usually connected in an N+1 or N+N configuration. Care must be taken to ensure that each unit is not loaded beyond the point where any unit taking on the additional loading caused by a UPS failure would suffer an overload and subsequent cascade failure. For example three 500KW units connected in N+1 could support a total load of 1MW or 333KW each in normal operation and 500KW if one unit failed. To provide an N+N design with the same load would require four UPS configured to support no more than 250KW each.

Generator Set

Engine Resillience. Engine design needs to follow the same rules with excess capacity needed to provide resillience through failure as well as the capability to perform maintenace without taking our data center offline.

In the real world, with equipment being moved into and out of data center halls continuously, it is almost impossible to maintain the proper power balance across each of these components and many sites drift into a situation where a cascade failure will occur when a single power component fails. Most sites recalculate every few weeks or months and then - in a panic - reconfigure and rebalance their power feeds when an inballance is detected.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Why is the Cost of Electricity Growing?

April 27th, 2008

Electricity is a traded commodity with a Futures Market, exactly like oil, tea and pork bellies. The wholesale electricity price is driven by analysts’ perceptions of the relationship between supply (how much is readily available and at what cost) and demand (how much is required now and in the future).

The wholesale electricity price is one of the most volatile commodity indices today: 

  • It doubled between January 2003 and December 2005
  • Between 17th and 22nd November 2005 it increased by 16%
  • Fluctuations within a day are frequently as much as 2% to 5%
  • It is much more volatile than stocks and shares on the FTSE.

Analysts’ perceptions are based on their analysis of many factors that influence supply and demand.  These factors include:

Changes to prices of related commodities such as oil, gas and uranium The US Energy Information Administration (EIA) has a great Website that tracks Electricity Prices and the drivers that are pushing them upwards.  It is important to note that 80% of the electricity produced in the world is it from three fuels: coal, natural gas and uranium. Between the start of 2000 and the end of 2006, largely due to the increased demand from India and China, the cost of both uranium and natural gas multiplied by 7, whilst coal almost doubled in price.

Short and long term local weather forecasts Baseline electricity consumption is driven by a number of factors and one of the most significant is the local temperature. High temperatures impact air conditioning and refrigeration usage, whilst low temperatures drive heating usage upwards. During a weekday usage fluctuates wildly with peaks occurring at 16:30 when children come home from School, during commercial breaks in televised sporting events and popular television programs etc..

The impact of international events such as natural disasters and International politics.  We live in challenging and uncertain times. This means the relationship between these factors changes frequently, which in turn leads to volatility in the wholesale price of electricity. So on a daily basis prices can rise and fall by a considerable amount. Every serious analyst is predicting continuous increases in the price of Electricity into the future.

Reducing the energy consumption of our data centers is a commercial imperative as energy costs are becoming the overriding factor influencing the Total Cost of Ownership of IT equipment.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Fire Detection and Suppression

April 26th, 2008

A large open room with hundreds of Kilowatts or even Megawatts of electrical energy being consumed and a set of powerful fans blowing air in sounds like an ideal place to make a great fire! Detecting fire quickly and extinguishing it before it takes hold is a critically important feature of all data centers.

Fire Detection

Early smoke detection is performed by VESDA which works by sucking air into a mesh of pipes connected into a highly sensitive smoke detector. The detector shines a laser light through the collected air and checks for smoke in the airflow by detecting the light scatter that this causes to the laser beam. The detector sends a signal off to a control system that typically sets off an audible alarm and sends an alert to a Building Management System for manual investigation.

Typically data centers have a secondary system that takes more extreme action which may include:

  • Setting off an audible fire alarm
  • Shutting down equipment in the room
  • Triggering the release of fire suppressant

The options for fire suppressant are varied and include:

  • Water sprinklers
  • Fine Water Mist
  • Argonite
  • Innergen
  • Halon Gas (now banned)

Water sprinklers cause damage to expensive computer equipment when discharged but are safe for humans and cheap to install and maintain.

Fine water mist uses the same principle as outside air conditioning, by dropping the temperature at the area of discharge to extinguish fire. By using ultra pure water, damage can be minimized.

Argonite is a very popular 50/50 mix of Nitrogen and Argon, two inert gasses that do not support combustion.

Innergen is also popular and similar to Argonite with the addition of Carbon Dioxide.

Halon is banned in most developed countries because it is a greenhouse gas and damages the ozone layer.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Asset Management is a Green Issue

April 24th, 2008

What do I have in my Data Centre?

During my time at BT I set the Data Centre team the challenge of getting Asset Management right, the team thought I was just being stupid and unreasonable but buckled down and with a bit of hard work got it close to 100% right with processes to keep it right. As a direct result of this effort, we were able to switch off 10% of BT’s UK Server estate and save $7M in electricity costs.

So how was it done? We got the assets right, got the rack layout right and then started focussing on who owned the servers and what they were doing. I took a very hard line approach and promised to switch off all servers that did not have an owner and a purpose. There was a lot of brinkmanship, change control items and angst but everyone got the message loud and clear that kit was being switched off and removed from the premises unless it was in active production.

The programme started well, with some good gains in the first few weeks and then we had an outage for a set of development servers. I guess that there was an expectation that my team would get the blame but it was the owner of the servers who got into trouble. I had taken the trouble to engage our CIO and explain that some of his team were not updating support information for critical applications. He got right behind the initiative and supported us fully. After that it was pretty plain sailing, we just issued batches of change tickets to take out servers without proper owner data either the data miraculously appeared or the servers got shut down.

Proper Asset management is critical to a well run IT Operations shop. Without the right data you can’t support the businesses applications, can manage the costs of providing IT and are always on the back foot.

 

Automatically Generated Rack Layout from Reflector

With proper Asset Management rack planning, change control, incident management and financial controls are so much simpler to achieve. In the picture above Glasshouse Technologies integrated their Reflector product with Rackwise Rack Management.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Intelligent Cooling in Data Centres

April 23rd, 2008

The basis of intelligent cooling is that it allows us to deliver only the cooling that is required to meet our temperature requirements. By reducing the flow of air down to that which is actually needed (rather than just blowing as fast as we can) it is possible to reduce the energy demand of our cooling subsystems (conventional, free or fresh). We do this by varying the rate of flow or the temperature of input air or both to individual areas of a data centre or to a particular cabinet.

There are largely two approaches to determining what we need to do to drive intelligent cooling in data centres. The first involves measuring the temperature of exhaust air in each cabinet the other is to measure the input power per cabinet. The first involves retrofitting temperature sensors and cabling to bring the results back to a control system, the second involves leveraging intelligent PDUs (iPDUs) that measure and report on (via IP) the power draw from each cabinet and power strip.

During my time at BT I installed intelligent Power Distribution Systems (iPDUs) in the Rochdale and Cardiff sites. These provided us with data on rack level and facility power consumptions (in conjunction with the Supervisory Control And Data Acquisition (SCADA) system). Using output from iPDUs via the SCADA system allowed us to calculate the power consumption by room, rack, and aisle.

The diagram above shows an example of relatively low and relatively high power draw in different zones of the data centre halls. There are two aisles and the addition of power from all adjacent cabinets sums to provide a real time picture of the total energy delivered to the zone. As almost all of the input power is converted to heat we can also see zones of the data centre which are operating at higher or lower power draw. This introduces the potential of controlling our cooling airflow supply according to demand and in real time.

Active Airflow Control (AAC) is a new concept in data centre cooling. AAC can be used in a macro way using fans and remotely controlled vented plenum floor tiles - Room Scale Adaptive Cooling, it can also be used in servers in a micro way with each individual server managing it’s own airflow by speeding up and slowing down internal fans. Intel have a very good movie showing AAC in action on a server.

Floor grilles can open or close dependant on the cooling demand of each zone of the data hall. This is linked to under-floor pressure sensors which, in conjunction with the power data provided by the SCADA system, allows automatic airflow control. This is linked to variable fan speed DFUs which, when the fan speed is reduced, will facilitate a huge reduction in energy consumption.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

DimDim

April 22nd, 2008

Just found a great new product on the web: DimDim a freeware collaboration tool that looks and feels really cool.  There are a number of critical things that DimDim have done to ensure success. The first is to make the product really lightweight, there are no downloaded Java modules, so much easier to make work in a corporate firewalled environment. The second is that it is freeware for normal use where DimDim roll out Beta code and take the RAD approach of failing fast. For critical users there is a stable hosted (either by DimDim or the consumer) environment that uses solid code that has been through the free beta consumer filter. A good and proven model that is win - win. The final winning strategy is that it is open source code.

DimDim certainly have some impressive financial muscle behind them raising risk capital from Nexus India Capital (connected to Red Hat Linux), Index Ventures (invested in Skype and MySQL) and Draper Richards (invested in Hotmail and Skype).

Good luck guys, I hope it works for you.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Sustain IT, London 2008

April 22nd, 2008

Today I spoke at the Sustain IT, Green IT Conference in Central London,  The topic I chose was Green IT through Discipline and Hygiene. My conference notes as well as two of the other presentations are available here on my .mac Public Folder.  (I love the whole .mac Web experience, like most things Apple it is just so well thought out and implemented, drag, drop, publish - wow.)

One of the very interesting things I am picking up is an almost complete lack of grasp of some of the basics of Data Centre energy usage drivers.  There is still a focus on power supply efficiency in servers (a diminishing return as most vendors have now woken up to the bad press inefficient power modules cause).

EU Director General Project Officer, Paolo Bertoldi did get it and was very knowledgeable about the impact of maximum operating temperature on data centre efficiency. Later we had a discussion about the relative merits of DC Power and AC Power in the data center, an area that is still unclear in many minds.

Catriona McAlister from Market Transformation Programme in DEFRA shared a great presentation on the UK Governments approach to driving energy reduction targets, lots of common sense and some great sites.

One of the neatest quotes I heard today was from Gary Hurd, the Technical Infrastructure Manager from the John Lewis Partnership. Apparently the John Lewis light-switch stickers say:

Switch me off, you are burning my bonus

A really cool idea that appeals to the joint drivers of greed and green - brilliant.  Green is Good!

I also heard John Suffolk, CIO HMG tell us all about what he is doing to drive a green agenda. Like me, John is hands on, lets just get it done, lets not burn all the time measuring there is so much low hanging fruit. Refreshing approach, as a UK Taxpayer this is all good news. Just do it much, much faster John and give me my money back please. :-)

Damian Schmidt, CEO Strato (Germany) gave a great presentation of what his firm have been doing around implementing green technology in a huge scale managed hosting environment. Damian made a very telling point, Co-location and green objectives just don’t fit together well. Co-location customers want to place their own hardware in a data centre environment and pay for the space, power and cooling capacity but supply and manage the hardware themselves. Green approaches always need us to pay attention to detail, to standardize, to plan and manage airflow, to optimize workloads. Co-location just gets in the way.

Dr. Mario Tobias Executive Board BITKOM was brilliant. He showed a video of his stand at the Hanover Fair where they implemented green concepts and showed a DRAMATIC 75% cut in energy costs over the period of the show. This is a real hearts and minds approach, where BITKOM appeal to the pocketbook and conscience together.

Steve Pickett, Head of IT, Rothschild was part of a panel session and shared with us his approach to getting things done. A real professional with 9 years as CIO at Rothschild, having come from Oil & Gas and FMCG sectors. The lesson I took away is that good behavior takes lots of effort to ingrain (Steve used the example of USAF Pilots who need 35 training cycles to bang the message home) and just a few slips to destroy the habit (5). A key message is that understanding psychology is key to success and the approach needed to start good behavior may well be different from an approach to make it continuous. We all need to be continuous salespeople.

Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Leadership Top Tips

April 19th, 2008

My little brother Laurie O’Donnell who is a top man in the Education world, (winner of the George Lucas Education Foundation Global Six) was asked to speak to a group recently on the subject of leadership and identify some top tips for aspiring leaders. Here is what he came up with:

  1. Be a lifelong learner - as soon as you think you know all the answers you can be sure you don’t. Read as much as you can and widely as you can (not just ‘how to’ stuff but also the classics of literature because they are timeless and deeply rooted in the human condition).
  2. Know what you stand for - what do you believe, what are the values that underpin your work and why are you doing the job in the first place?
  3. Recruit well - never employ someone if you are not sure about them. The best employees don’t need to be motivated, your job as a leader/manager is to make sure you don’t demotivate them (Jim Collins in ‘Good to Great’ talks about getting the right people on the bus and in the right seats on the bus).
  4. Embrace complexity, ambiguity and change - that’s the way the world is so you might as well get used to it, deal with it and where possible manage, control and direct it.
  5. Be as strategic as you possibly can - try to see the big picture, look to the long-term and try to take people with you (taking time to win hearts and minds pays dividends in the long run).
  6. Know your role and what counts as success in your role - if you are leader then you can’t spend all of your time on management tasks. Any more than a shop assistant can spend all the time stocking the shelves and sweeping the floor. They need to serve customers you need to provide leadership.
  7. Learn how to prioritise - everything you do is important because you are spending your valuable time doing it (and unless you are self employed somebody else’s money, eg the taxpayer). You need to be able to quickly identify exactly where you need to spend your time and try to deal with tasks before they become urgent, ie be proactive rather than reactive (Stephen Covey of the 7 Habits writes well on this). You need to develop a light touch. Land on the task that needs your attention give it exactly the right amount of attention and then move on without carrying any baggage to the next task.
  8. Try to make your work serious fun but don’t take yourself too seriously - listen a lot and try to smile (even laugh) more.
  9. Be as optimistic as you can be given the context you find yourself in - back to Jim Collins and what he calls the Stockdale Paradox, never lose the belief that you will in the end be successful but make sure you also confront the ‘brutal facts’ of your current situation and deal with them. Martin Selegman’s ‘Learned Optimism’ is also important here, blind optimism is not a rational position you sometimes need to be more pessimistic.
We obviously come from the same gene pool. I could not have written a better list myself.
Why not share this article? These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • bodytext
  • NewsVine
  • StumbleUpon
  • Reddit
  • del.icio.us
  • Mixx
  • Google
  • ThisNext
  • Slashdot
  • Technorati
  • E-mail this story to a friend!
  • Print this article!

Why do IT Operations suck?

April 19th, 2008

One of the strangest things about IT people is that they often miss the obvious until it is explained to them (sometimes with the aid of a hammer). We are all so used to following the rules and doing what everyone else does. This particularly applies to IT Operations, also known as the Command Centre or the Operations Bridge. The traditional Command Centre has a very simple function, that is to “log and flog”, to watch for traps coming from instrumentation on the managed estate and log them into the Incident Management workflow system.

 

It is done like this because highly skilled technicians hate that kind of work and are more expensive than typical operators. The skilled guys take the incidents on the workflow system and deal with them in order, fixing a piece of storage, restoring a server, fixing a network connection. Typically the work is organized into first, second and third line support to protect the really good (third level) guys from having to fix systems. The same good guys who design the systems that break! (Is there a lesson here?)

The operator, command centre job reads like a particularly low skilled occupation, low status and low value. Guess what, it is and more and more companies are outsourcing or off-shoring this activity. Actually, I believe that it is just wrong headed to do operations like this because it is too slow and too expensive to run a command centre like that with complex distributed systems.

You see the command centre, operations bridge concept was designed in the days of the mainframe, lots of batch processing and batch jobs to fix because the input data was wrong. Lots of simple things that operators could do at a low cost and 24×7. When the problem got too hard then it got escalated via the incident management workflow to the smarter guys. That worked just fine. Today systems are just so complex and convoluted that this approach breaks down and is slow and cumbersome. If we look at trouble to resolve cycle times we find that 85% of the time is taken up in handoffs as the incident is passed around resolver groups looking for a home.

Resolver groups are also a leftover from the past when systems were simple and could generally be resolved in a technical silo, because it was a mainframe issue or a network issue and it was obvious what the problem was. Today problem determination and allocating an incident to the right resolver group on the first attempt is a matter of chance. I have seen an average of three to five handoffs per incident as being a normal distribution in todays IT Operations.

The other thing that 20th Century Operations misses out is the Customer. Read any marketing book for the last 50 years and the first rule is to put the customer at the centre of your business, that is how high growth, successful companies work. We must ask ourselves what the customer wants from IT Operations and funnily enough it is very simple, they want IT Operations to protect and recover the services that we manage for them. So what do we do? Typically we don’t even understand the end to end service that the customer has been sold. We focus on fixing boxes and recovering networks without any clear understanding of how the customer is affected.

If you believe what I am saying, then it is clear that 20th Century IT Operations is fatally flawed and needs a complete rethink. Multiple levels of support doesn’t work, log and flog adds no value and where is the Customer?

BT Sheffield CEMC

So this is where Customer Experience Management comes in. CEM is where IT Operations throws out the rule book and starts again. Starts again with a total focus on Customer Experience and Service. OK so it sounds a bit trite and Business School speak but I promise you it works. The boring picture above is of my old boss Al-noor Ramji the CIO of BT visiting the Customer Experience Management Centre in Sheffield UK. He came to see what had happened to radically change incident handling, customer satisfaction and trouble to resolve cycle times. He came to see how BT fixed it’s Broadband Lead to Provision service which led directly to BT’s massive growth in broadband subscribers. He came to see how we put customers are the centre of IT Operations.

So what is Customer Experience Management all about and how does one implement CEM in a real business? Actually it is quite simple:

  • First we work out what services we want to support.
  • Then we create tube maps of each end to end service (as experienced by the customer)
  • Then we appoint a Service Operations Manager to own Service Protection and Recovery
  • Then we organize around service 
  1. By up-skilling the operators to understand the service
  2. By ensuring that every part of our business that is involved in delivering the service is integrated
  3. By co-locating the people with the right level of skill to fix service in the same room
  • Then we build instrumentation that tells us if the customer experience of the service is good
  • We put tools exploitation engineers in the CEMC to reduce Problem Management cycle time
  • We also need to drive standardization of the IT Infrastructure (less urgent)
  • We need to drive autonomics to improve compliance, build standards and cycle time

By taking this approach we kill off the problems of 20th Century IT Operations. We reduce trouble to resolve cycle times by co-locating the necessary skills in one place and taking out the hand off time (85% reduction). We focus on service by organizing around key services and mapping those services using instrumentation working along the tube maps. We delight customers by caring about the same thing they do Service Protection and Service Recovery. Service Recovery works better because initial triage, where we work out where to focus our attentions, is enabled by understanding which issues are actually impacting customer experience.

Why n