Complexity and failures in the NYT

I’ve been posting occasionally when there’s some huge meltdown of a big service like the two recent Blackberry outages. My point is usually that the systems are too complex so the failure mode is usually unpredictable and hard to track down – hence the sputtering of PR people days after big outages while sysadmins are frantically digging through logs, configs and system metrics all over the place.

Anyway, looks like the NYT picked up on the same idea. Good article citing recent outages at United and Skype and tying them into the larger problem of increasing system complexity.

It quotes Andreas Antonopolous, who’s been one of the analysts to really understand why IT Search is necessary in the face of increasing chaos and change in the datacenter. Here’s a video clip hosted on of him talking about this.

