This one’s even worse than taking Ebay’s market cap down $1 billion yesterday.
Why do outages last this long? Because it’s too hard to find out where the problem happened.
Skype finally posted that the issue was a problem in their networking code at 10 p.m. last night, about a full day after the problem started, while rumours flew around that they’d been hacked. I bet it took Skype that full day to find that the problem was with the networking code. Why? Because if Skype is anything like any other big IT operation I talk to, dozens of admins were spending the day writing and running slow one-off scripts and testing various hypotheses against log data, configurations, code, scripts and the like scattered around the thousands of servers that would be behind a service of this scale.
If you work at Skype and I’m wrong about that, please let me know. But I might not believe you.
So how should IT shops avoid this?
- Capture all logs, configuration changes, script changes and source code revisions in real time in a central place
- Index it all so you can search it fast
- Put it behind a useful web interface that makes it easy for everyone in IT to navigate the data
Funny, sounds a lot like Splunk, right?