May 09, 2019

5 Minute Read

No Fighting in the War Room: Two Steps to Reduce Mean Time to Resolution with Modern Monitoring

By George Khoury

Splunk is committed to using inclusive and unbiased language. This blog post might contain terminology that we no longer use. For more information on our updated terminology and our stance on biased language, please visit our blog post. We appreciate your understanding as we work towards making our community more inclusive for everyone.

You’re at dinner with friends, messages start coming in, your favorite football team is about to win their first game in a very long time. You jump on the mobile app to see the last five minutes live. Then you get that dreaded error message …

“Service is unavailable, please try again”

On the other end of the message, it’s also just as frustrating — the call to war has sounded and the generals start assembling. The team meets in the war room — be it physical or virtual — and the clock is ticking.

The service owner is screaming that users can't connect …

The application team can see the service is up and users are connected …

The network dashboards are green …

The infrastructure team says that their services are green, and it's the same message as you go through each team. Just like a watermelon — all green on the outside, but red on the inside.

Businesses have two ways to make money — sell more goods or reduce costs. An incident involving a critical application will impact both of these. IDC reports the average cost of critical application failure is $500,000 to $1 million per hour (DevOps and the Cost of Downtime: Fortune 1000 Best Practice Metrics Quantified, Stephen Elliot, 2014). What should be top of mind is the mean time to resolution (MTTR). What happens time and time again is the mean time to innocence.

So how did we get here?

War rooms originated in the military, and many processes for handling an incident were established many years ago. The challenges we’re experiencing in many organizations are also visible in the military, which has addressed this through a concept called multi-domain battle (MDB). MDB utilizes five domains in a joint coalition effort: air, sea, land, space and cyber. For MDB to work, the military had to do away with domain hogging.

For the non-military folk like me, domain hogging is when a crisis occurs in a land domain — the army is considered the owner of that domain and is expected to respond. If a crisis occurs at sea, the navy is viewed as owning that domain, so a ship or sub-surface solution is applied.

Imagine that the army shot down a missile that was fired from a plane, and the plane had been launched from a ship. The army resolved domain hogging through collaboration and visibility. This is similar to the challenge faced by many organizations. If the army can get multiple domains and countries to work together, imagine applying this to IT war rooms.

Reducing mean time to resolution with modern monitoring

A modern IT operations platform enables IT organizations to apply the army’s successful MDB approach. Modern IT operations have four key components that deliver collaboration and visibility:

Investigate: metrics and logs
Monitor: Visibility service apps, containers, IT infrastructure, networks
Analyze: Service insights and event analytics
Act: Collaboration, AIOps, prediction, orchestration

Collaboration

The first step is to break down silos to support collaboration and learning, enabling teams to work across organizational structures, gleaning the information necessary for efficient solutions.

"Alone we can do so little; together we can do so much." – Helen Keller

One of the best ways to enable collaboration is to ensure everyone is working from the same data but is able to view it from their context. Imagine allowing the network team to access the same data as the application team. Both teams look at the data from their perspective to identify whether an incident is a performance, access or customer experience issue. Perhaps most important, this instantly ensures everyone is on the same page — saving time and stress worrying whose data is right. You’ve successfully replaced stress with data and can work in parallel to find and solve the issue — making your users, customers (and bottom line) happy.

Looking back at the football war room incident

Imagine that a company is providing video streaming for a major sporting event. Five minutes before the end of the game users start complaining that they can’t connect. The application monitoring team doesn’t see any issues — its servers, CPU, memory and network all look healthy, and the database connection is working and processing requests. Individually, all looked fine. In this case, the firewall saw an abnormal increase in traffic in the last five minutes of the game, mistook it for a DDOS and started to block new connections. On its own, this took days to troubleshoot with significant impact to brand. As an end-to-end service, it was clear that new connections were being blocked at the firewall.

“Application monitoring is not user availability monitoring”

Prior to the event, the company was using a traditional application monitoring solution which was unable to provide end-to-end availability monitoring of the application from the user perspective. This is a case of silo monitoring where each team monitors their component. Unfortunately, this approach has limitations which provide a false sense of availability and limits the ability to collaborate in the event of an issue. With a modern monitoring platform, the application team would have visibility to the firewall/security data and be able to see the error relating to the application sooner. They could reduce the MTTR by providing insight to the entire chain and perform a collaborative investigation to pinpoint the issue quickly. Like the team at skyscanner who were able to improve Skyscanner’s reaction time to both detect and resolve performance issues, ensuring their services are always available for their users

Visibility

Your Answers Are Hiding in the Silos- But Can You Find Them?

The second step is to provide end-to-end visibility of the service.

To identify a problem quickly, teams require a modern monitoring platform with data-driven service intelligence. When an application is down, they can identify the correct area to focus.

Previously in my career, I was asked to review a war-room incident for a company that was three days into an outage to their CRM. This company had a dozen screens and operators looking at various components and dependencies to the complex CRM system. The phone bridge was open and third-party support was on the call. I could hear the chatter — “Ping server XYZ. What's the response?” I then observed operators entering information into notepads. Triage had determined it was an issue with the application. The third-party support was reluctant to engage unless the customer could identify the fault. With so many moving parts, legacy applications, and dependencies the team was unable to see what was wrong. In this case, there was nothing wrong with the application itself.

Data-driven service intelligence provided the clarity needed to move forward. In this case, the issue was in the database. Three days of wasted troubleshooting at the application level could have been avoided, let alone the cost to the business.

Stop the fighting in the war rooms through visibility and collaboration

Artificial Intelligence for IT Operations (AIOps) reduces the events that lead to IT war rooms. Maybe one day we won't need them, but for the short term, we still do. With the average critical application failure costing approximately $500,000 to $1 million per hour, decreasing MTTR is key to any outage. Organizations can benefit from a modern IT operations platform to decrease MTTR and move up the IT maturity curve, and away from traditional, siloed IT.

Want to enable collaboration and provide end-to-end visibility across all of your IT services.

Learn more about using a platform approach for IT Operations.

George Khoury

George is Splunk's Product Marketing Manager and technical evangelist in APAC, responsible for communicating Splunk's go-to market strategy in the region. He works closely with customers to help them understand how machine data reveals new insights across application delivery, business analytics, IT operations, IoT, and security and compliance. With nearly 20 years in the IT industry working with large Enterprises, Manufacturing, Government and Banking sectors, George has extensive knowledge of enterprise IT systems.

IT 3 Min Read

A Blueprint for Splunk ITSI Alerting - Step 5

Splunker Jeff Wiedemann wraps up his 6-part blog series with a how-to on throttling alerts in Splunk IT Service Intelligence (ITSI)

IT 1 Min Read

Fueling the Global Recovery Requires the Right Data Culture

Splunk's James Hodge participated in a virtual roundtable for The Times newspaper in the UK to discuss the role data should play in the global recovery. Read more about the discussion and the key findings here.

IT 6 Min Read

Understanding and Baselining Network Behaviour using Machine Learning - Part II

In this second installment we will continue to use the Coburg Intrusion Detection Data Sets (CIDDS) to determine baseline behaviour for one of the nodes we identified as critical in the first half of this series.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk