There are quite a few famous names in the history of information technology, from Charles Babbage and Ada Lovelace to Bill Gates, Linus Torvalds and Robert Matthew Van Winkle. If that last name sounds unfamiliar, you may know him better as Vanilla Ice. Not only did his semi-autobiographical ballad “Ice Ice Baby” forever change the world of music, but it also contained a timeless and vital message for IT departments everywhere: “Stop. Collaborate and listen.”
Let’s break it down, shall we?
Stop fighting in the war room! War rooms won’t always be necessary, thanks to advances like artificial intelligence for IT operations (AIOps), which predict and prevent the issues that lead to war rooms. For now, they’re still a reality, but they don’t have to be contentious. With the average critical application failure costing approximately $500,000 to $1 million per hour, the war room can be a highly emotional place of finger-pointing and defensiveness. People are trying to justify their actions and deflect blame when they should be working together to reduce Mean Time to Resolution (MTTR). Once you’ve stopped the fight, the next step is to collaborate.
Collaboration is valuable for lowering lag times by allowing teams to find and remediate the problem. More complex alerts may present as problems with the application, when in reality it may be the network. (“It’s always the network.”)
Unless you have a network operations center (NOC) commander driving alerts out, they just sit in the alert monitoring tool. Who’s on call? What department? What email alias should you use if you get an out-of-office message? It’s amazing how many companies rely on a list of phone numbers kept in a spreadsheet.
All of these roadblocks can drive up your Mean Time to Acknowledge (MTTA), which can be just as important as MTTR. An AIOps-enabled system lets on-call teams find and fix problems faster with automated incident management routing, collaboration and reviews.
PSCU, America’s premier payments credit union service organization, selected an AIOps-enabled system built on Splunk Enterprise and VictorOps. The result was a reduction in MTTA from four hours down to two minutes, taking them from “Nah, not my problem” to “I’m on it” in a fraction of the time it had previously taken. Or as Van Winkle would put it, “If there was a problem, yo, I’ll solve it.”
Van Winkle boasted of “cooking MC's like a pound of bacon.” While this may well be an appropriate reaction when encountering a sucker MC, in the aftermath of an outage, it’s vital to listen to your team. The IT labor market is tight. Keeping good people is important. Bad war rooms cause people to leave and look for kinder, gentler war rooms. Technologies that focus on collaboration foster a better experience, just like mobile technology improves the on-call experience.
Conducting post-incident reviews gives you an opportunity to calmly review the facts—like outage details and interactions between people and teams—to better assess how incidents happen, and, even more important, how they’re resolved.
Anything less than the best is a felony
An incident response system designed for developers, DevOps and operations teams can help you reduce outage time and add confidence to your high-speed DevOps delivery and operations. VictorOps takes alerts from your monitoring tools and applies on-call schedules and rules to engage the right teams and people so you can start resolving problems faster.
Once your team is in a “firefight”, VictorOps makes collaborating easier and faster by engaging the right experts and teams over a native mobile app or web interface. Analytics enable your team to provide better retrospectives so you can continuously improve your team’s incident response. Collaboration and analytics drive shorter outages, less waste in resources, improved utilization of your team’s “tribal knowledge” and a more empowering, collaborative and enjoyable on-call experience for your team.
You might even say that it helps you keep your composure when it’s time to get loose.