Ensuring Downtime Is Low When the Stakes Are High During Black Friday

CTO Stack November 20, 2024 Jason Conger

Last year’s Black Friday brought $70.9 billion in online sales globally, up 7.5% from 2022. It’s the biggest sales event of the year, where online retailers can least afford to experience systems crashing. Resolving a complex IT outage costs organizations millions in lost revenue, productivity, overtime wages, and legal expenses. Restoring brand health alone takes time, as CMOs report an average of 60 days, according to Splunk’s The Hidden Costs of Downtime report.

With Black Friday approaching fast, it’s not too late for organizations to take steps to minimize the risk of an outage and bounce back faster when an incident occurs. Beyond quick fixes, there are also long-term strategic initiatives that can increase an organization’s preparedness.

Bake resilience into your organization’s infrastructure

An organization is only as resilient as its infrastructure. Leaders should keep digital resilience in mind when architecting their IT estate, much of which comes down to redundancy. For example, does the organization have backup systems and a regular data backup schedule? Are there additional data centers that can be tapped if a sales spike generates more workloads than normal and other redundant hardware that can keep your organization going? Do your teams utilize load balancers to distribute traffic across multiple servers and regions to avoid overloading? Are systems put in place that can mimic user behavior to ensure end-to-end systems are up and responsive?

Being resilient entails having the infrastructure to handle whatever arises, whether it’s literal rain or shine that can take out your data centers or the largest traffic peaks a retailer has ever seen.

Develop a culture of resilience

But there is also a human component to keeping an organization resilient and customer experiences maintained. Creating a culture of resilience across an organization means that teams are trained to understand the business outcomes impacted by downtime and how to respond when outages happen.

Disaster recovery drills and tabletop exercises can improve preparedness and are quick, short-term steps to increase an organization’s readiness. Ensure teams have tested what customer-facing scenarios may arise and understand how their systems perform when requests are high in volume, are costly to execute, or have big payloads. Many teams are also using chaos engineering, a practice in which intentional, controlled failures are orchestrated to improve their incident maintenance strategy. This is similar to when security organizations hire ex-hackers to purposely look for vulnerabilities to compromise critical systems.

After an outage, seize the opportunity to optimize

When an outage occurs, the incident response and crisis management teams should be notified immediately, and the pre-established triage process should unfold.

What happens next is arguably even more important. IT leaders should direct their teams to conduct a rigorous postmortem that identifies root causes and detailed next steps to avoid future incidents. Meanwhile, security leaders need their teams to be on the lookout for exposed cybersecurity vulnerabilities that hackers may have exploited during an outage. However, it is not just system outages that organizations need to be mindful of. Throughout this period, leaders need to manage burnout across teams, making sure to rotate staff and collaborate with HR to maintain employee well-being during the high-intensity, post-outage season.

We may also see generative AI play a bigger role in helping organizations recover from outages. According to Splunk’s The Hidden Costs of Downtime report, more than half of organizations in the technology sector have been using generative AI to generate detection summaries, troubleshoot, and remediate after an incident occurs. Of these, 64% report “significant gains” from using AI technology. After an outage, leaders have an opportunity to create longer-term goals to learn from what happened and minimize the impact of future incidents like it. They can review prevention, response, and support procedures, as well as revise crisis communication, incident response, and business continuity plans.

Subscribe to the Perspectives newsletters to have the latest trends and insights across security, observability, and AI delivered straight to your inbox.

Style

two-column

No results

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/perspectives-promo