Wireless meltdowns Thursday – shoulda Splunked it!

Nearly everyone at Splunk fell victim to a series of wireless meltdowns yesterday evening – across three different carriers. Cingular was down for 4 hours in the San Francisco Bay Area due to a “software glitch.” Verizon and T-Mobile Blackberries were delivering email 6-12 hours late.

(The local CBS station picked up on Cingular’s outage. In the humor department, their ad server was showing a Cingular Wireless ad below the story when I looked this morning.)

This is *exactly* the reason smart operations and development teams are picking up Splunk. Why does a software glitch leave a major wireless carrier offline for 4 hours? It’s a guess, but a pretty safe one, that there were sweating sysadmins copying and emailing logfiles and configurations and running diagnostic commands on hundreds of servers while impatient developers who could actually debug things waited for the data to trickle in.

I bet those developers would have found the problem a lot faster if they had real-time search access to all of the production data.

Anyone with information that’s more than a guess about this? Would love to hear from you in the comments.

Posted by