Recently I gave a talk at the BT annual technology gathering. The setting was a really beautiful estate called The Grove just north of London in Hertfordshire England. A couple hundred of BT’s smartest technology managers were in attendance and I was supposed to think of something to hold their interest for an hour. I got to thinking about all the technology and infrastructure BT must have and how in the world do they manage it. I started gathering data. With internal growth, new projects like BT’s 21st Century Network and acquisitions over the past decade through BT Global Services outsourcing contracts the company has a lot of IT infrastructure.
I also spent a few hours with some of BT’s brightest architects who are working on attempts to virtualize every layer of their infrastructure — network, storage, database, application, web servers, VoIP, collaboration, ordering, billing, provisioning, monitoring etc. What’s their biggest problem I asked. Resoundingly it was “our customers are still often the ones that tell us stuff is broken.” This was so reminiscent of my time at places like Yahoo! where we’d have these 7×24 war rooms during key outages and the daily conference calls with 30-40 people on the line all emailing logs and configurations to each other.
As our IT infrastructures become incredibly complex, dynamic, service oriented, virtualized and mission critical we’re confronted with this battle raging in our data centers. And it appears the machines are winning and the humans are losing.
Our biggest problem is figuring out — did something go wrong? Why? Where does truth lie? According to market researcher IDC In 2007 > $140B spent managing the world’s data centers. IT OPEX is growing at 2.5 times the rate of hardware spend and 1/3-1/2 of TCO is spent recovering from problems. The cost of availability now dwarfs the purchase and maintenance cost of technology.
So what have we as an IT industry done to address the problem?
We’ve created concepts like ITIL and CMDBs. While there are some good processes improvements here for sure, these top down modeling approaches and pre-determined rules only tell us what we already know. In my experience it is not the things we already know about that bite us in the ass and take our systems down for prolonged periods of time. It’s the multitude of unanticipated and unavoidable dependencies and interactions that take place in an complex system. And it’s impossible to know what set of dependencies and interactions will cause downtime until it occurs. Our infrastructures are just too indeterminate. That’s the point after all. Tier it, load balance it, virtualize it. So we don’t have to worry about the dependencies and interactions among all the different components. Well guess what? We do have to care. Because we have to fix it when it goes wrong.
Take the analogy of a complex air traffic control system. Sure the air traffic controllers feel really great when they arrive at work in the morning. They’ve got their coffee, flight plans and a good handle on the early morning inbound and outbound traffic.
Then the day gets a bit more challenging. Weather conditions over Chicago backs up landings at O’Hare. A baggage handler and mechanic strike slows down JFK departures. A pilot radios he’s three degrees north over Pennsylvania but where is he really? Now you need radar. Throw the flight plans out the window. You needs to know what’s actually happening now.
So how do we establish the equivalent of radar for a complex IT infrastructure. Component monitoring doesn’t work any more. If the problem is a single component failure, we already know about it. We’ve already automated the swapping in of a new machine or device. And we can reboot software components automatically. IBM’s has their own marketing play on this called “Autonomic Computing” but that too seems to only focus on the simple single component issues not the indeterminate chaos that ensues in a real running system. And it seems like more slideware than real solutions.
In my next post I’ll tackle the issue of how we might look at things differently.