$1 billion market cap loss due to service problems. Ouch.

This one’s even worse than taking Ebay’s market cap down $1 billion yesterday.
Why do outages last this long? Because it’s too hard to find out where the problem happened.

Skype finally posted that the issue was a problem in their networking code at 10 p.m. last night, about a full day after the problem started, while rumours flew around that they’d been hacked. I bet it took Skype that full day to find that the problem was with the networking code. Why? Because if Skype is anything like any other big IT operation I talk to, dozens of admins were spending the day writing and running slow one-off scripts and testing various hypotheses against log data, configurations, code, scripts and the like scattered around the thousands of servers that would be behind a service of this scale.

If you work at Skype and I’m wrong about that, please let me know. But I might not believe you.

So how should IT shops avoid this?

  • Capture all logs, configuration changes, script changes and source code revisions in real time in a central place
  • Index it all so you can search it fast
  • Put it behind a useful web interface that makes it easy for everyone in IT to navigate the data

Funny, sounds a lot like Splunk, right?

