When it comes to downtime, the focus has mainly been on incidents caused by traditional IT issues, overlooking ones brought on by cybersecurity failures. However, downtime can come from anywhere: According to Splunk’s The Hidden Costs of Downtime report, 56% of incidents are cybersecurity-related, while 44% stem from application or infrastructure issues.
Consequently, the most resilient companies employ successful mitigation strategies that account for application or infrastructure issues and cybersecurity failures. Here, we will explore downtime’s most common culprits and best practices your organization can adopt to mitigate downtime, regardless of origin.
Despite an abundance of downtime causes, human error is number one. This is true across Security, ITOps, and Engineering. Half of all technology executives surveyed admit that human error — such as misconfiguring software or infrastructure — is "often" or "very often" to blame. And it’s clear why: making even simple mistakes can lead to performance errors that drag systems down or put a company's security at risk.
Not only is human error the most common cause of downtime, but it's also the hardest to detect and fix. Its MTTD is 17-18 hours, and its MTTR is 67-76 hours. That’s 2-3 days of panic and finger-pointing.
On the security side, respondents say malware and phishing attacks are also frequent causes, while many say some of the rarest incidents they encounter take longer to find and fix. For example, the detection and recovery times for “zero-day” exploits are likely high because it’s difficult to identify a root cause, and organizations often lack the processes to address it.
Software failures create considerable downtime for ITOps and Engineering teams as organizations adopt modern application development and deployment practices that are more complex and have added increasing points of failure. 34% of ITOps and Engineering professionals also blame hardware failure.
What other factors could cause downtime? 43% of tech respondents admit their dev teams often go outside the approved tech stack to deploy new technologies, which could contribute to more downtime and serious security incidents. Meanwhile, 78% say their organization is willing to accept downtime risk to adopt new technologies. Complexity in the application's infrastructure and architecture, along with heavy demand for pushing innovation out quickly, leads to even more instances of human error.
Whether an incident stems from a security breach, network outage, or a software/hardware failure, below, we’ve outlined some best practices your organization can adopt to mitigate downtime.
If there’s one lesson to take away from The Hidden Costs of Downtime, it’s that digital resilience is a business imperative. The majority of technology executives surveyed admit that the negative impacts they experience from downtime are unacceptable. There’s just too much at stake, both for companies and customers. By understanding that downtime can come from application, infrastructure, and security issues, putting plans in place that address downtime’s diverse causes will help you champion a more resilient business.
Read The Hidden Costs of Downtime report for more on how the most resilient organizations set themselves apart from the rest and Splunk’s recommendations for deterring downtime.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.