Downtime Demystified: A Deep Dive into Common Causes and Fixes

By David Dalling

When it comes to downtime, the focus has mainly been on incidents caused by traditional IT issues, overlooking ones brought on by cybersecurity failures. However, downtime can come from anywhere: According to Splunk’s The Hidden Costs of Downtime report, 56% of incidents are cybersecurity-related, while 44% stem from application or infrastructure issues.

Consequently, the most resilient companies employ successful mitigation strategies that account for application or infrastructure issues and cybersecurity failures. Here, we will explore downtime’s most common culprits and best practices your organization can adopt to mitigate downtime, regardless of origin.

Humans are (mostly) to blame

Despite an abundance of downtime causes, human error is number one. This is true across Security, ITOps, and Engineering. Half of all technology executives surveyed admit that human error — such as misconfiguring software or infrastructure — is "often" or "very often" to blame. And it’s clear why: making even simple mistakes can lead to performance errors that drag systems down or put a company's security at risk.

Not only is human error the most common cause of downtime, but it's also the hardest to detect and fix. Its MTTD is 17-18 hours, and its MTTR is 67-76 hours. That’s 2-3 days of panic and finger-pointing.

On the security side, respondents say malware and phishing attacks are also frequent causes, while many say some of the rarest incidents they encounter take longer to find and fix. For example, the detection and recovery times for “zero-day” exploits are likely high because it’s difficult to identify a root cause, and organizations often lack the processes to address it.

Software failures create considerable downtime for ITOps and Engineering teams as organizations adopt modern application development and deployment practices that are more complex and have added increasing points of failure. 34% of ITOps and Engineering professionals also blame hardware failure.

What other factors could cause downtime? 43% of tech respondents admit their dev teams often go outside the approved tech stack to deploy new technologies, which could contribute to more downtime and serious security incidents. Meanwhile, 78% say their organization is willing to accept downtime risk to adopt new technologies. Complexity in the application's infrastructure and architecture, along with heavy demand for pushing innovation out quickly, leads to even more instances of human error.

Cutting down on downtime

Whether an incident stems from a security breach, network outage, or a software/hardware failure, below, we’ve outlined some best practices your organization can adopt to mitigate downtime.

Always root out the root cause. 54% of technology executives surveyed admit they sometimes intentionally do not fix the root cause of a downtown incident. This could be for several reasons. For example, they may already plan to decommission an older application responsible for the outage, as it could have larger impacts or create outages in other areas of the business as well. Splunk recommends finding and fixing an incident’s root cause to be a best practice because it can stop repeat issues by singling out the underlying problem and pointing to a fix. Pro tip: Investing in Observability solutions and integrating and instrumenting your data across your environment (including security/data teams) will make finding and fixing root causes much easier. Getting rid of data silos creates thorough postmortems that will prevent repeat issues.
Connect your teams and tools. Since downtime can come from almost anywhere, complete visibility across SecOps, ITOps, and Engineering teams is essential. Sharing tools, data, and context will enable easier collaboration and problem-solving across teams. This will help your organization identify and fix the root cause faster so you can get back up and running quicker.
Be proactive. Resilient organizations take the lead in preventing issues. By investing in AI- and ML-driven solutions for pattern recognition, you’re equipping your SecOps, ITOps, and Engineering teams with a proactive and collaborative downtime prevention program. Predictive analytics powered by AI act as a force multiplier, helping to avert issues before they occur. Over half of technology executives surveyed report using generative AI features embedded into existing solutions to address downtime, with 64% claiming significant benefits. The most resilient organizations are more mature in their adoption of generative AI, expanding their use of these features at 4x the rate of the majority of respondents.
Adopt a no-tolerance approach to downtime. Our research underscores that the most resilient organizations experience downtime less frequently, recover faster, and incur fewer overall costs. Why? They grasp the financial impact of downtime more keenly than others. They see the substantial costs and view downtime as unacceptable, investing deliberately in practices and solutions to prevent it.

Resilience restores balance
If there’s one lesson to take away from The Hidden Costs of Downtime, it’s that digital resilience is a business imperative. The majority of technology executives surveyed admit that the negative impacts they experience from downtime are unacceptable. There’s just too much at stake, both for companies and customers. By understanding that downtime can come from application, infrastructure, and security issues, putting plans in place that address downtime’s diverse causes will help you champion a more resilient business.

Read The Hidden Costs of Downtime report for more on how the most resilient organizations set themselves apart from the rest and Splunk’s recommendations for deterring downtime.

David Dalling

David is a subject matter expert with over 20 years of Information Security experience and IT Operations. He is an accomplished, motivated, and versatile IT professional in a variety of Information Technology fields ranging from hands-on systems development, testing, and management to enterprise-level strategic planning, and consultation. David is a man of firsts; He helped get the DHS Enterprise Security Operations program its first ever ATO, wrote the first ever common control package for DHS, received a security engineering award at DHS HQ for developing a metrics program that contributed to DHS first ever perfect score card. Taking this experience David then lead the development of the first ever Managed XDR service to receive its FedRAMP Authorization. David has now taken his love to take on new challenges to Adventure racing were he purposely gets lost in the woods to compete with Trail running, Mountain biking and kayaking for 100s miles. As the Global VP for Splunk’s Cyber Strategist team, David helps drive the security strategy for Splunk and its security products.

CIO Office 5 Min Read

Transforming IT from Cost Center to Growth Engine with Observability

Observability is no longer just a practice; it's a strategic advantage for executives. Are you ready to leverage it?

CIO Office 4 Min Read

3 Ways To Lead Through the AI Revolution

AI is already changing the way we work. We discuss three of the most important ways to adapt your leadership strategy to come out ahead in the AI revolution.

CIO Office 1 Min Read

How the C-Suite Should Think About AI Today

Don’t drown in AI by diving into the pool headfirst. Here’s why being methodical in an approach to AI adoption will increase efficiency and deliver more value to customers.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk