Microservices have delivered on the promise of freeing up individual teams to work on services independent of one another. This flexibility has allowed for quicker and more agile application development. However, the explosion of services has added complexity to debugging efforts. The task of managing complex interactions between tens or hundreds of microservices turns on-call incident resolution into an absolute nightmare, not to mention the risk of a bad customer experience.
As our environments and SLA (service-level agreement) expectations evolve, our processes and observability tools must evolve with them. Rote tasks such as code rollbacks and auto-scaling of compute resources can now be automated away for increased responsiveness, customer satisfaction and SLA compliance. As a result, developers and operators can focus on more critical tasks while also increasing their productivity. As an industry we need to eliminate situations where on-call staff have to be interrupted from their current tasks or woken up at 3am to fix an easily automatable task.
"Toil is manual, repetitive, automatable, tactical work that scales linearly and is the main source of concern for SREs (Site Reliability Engineers). 59% believe there is too much toil in their organization and not enough has been automated to reduce that toil."
— 2019 State of SRE Report, Catchpoint
Introducing Splunk Infrastructure Monitoring and Amazon EventBridge Integration
We are excited to be an official launch partner of Amazon EventBridge. This new integration makes it simple to leverage the real-time problem detection capabilities of Splunk Infrastructure Monitoring to automate remediation actions. DevOps and SRE teams can now fully realize the promise of programmable infrastructure through closed-loop automation.
Amazon EventBridge is a serverless event bus that connects applications together, delivering a stream of real-time data from AWS resources, SaaS applications, and data from your own applications. With this new integration, joint customers of Splunk Infrastructure Monitoring and AWS will be able to operate their infrastructure and applications with continuous closed-loop automation to improve responsiveness, SLA compliance, customer experience, and the overall productivity of DevOps and SRE organizations.
Traditional Incident Response is a Slow Manual Process
Historically, there have been three significant barriers to proactive closed-loop event response in IT Automation:
- The latency of traditional observability tools is high, taking minutes and sometimes hours to get results from a query.
- Basic infrastructure metrics (CPU, Memory, etc.) combined with static alerts do not catch anomalies and outliers in complex distributed systems.
- Routing and handling information has traditionally been a security and technical nightmare.
Traditional incident response has been reactive rather than proactive. The common solution has been a one-off script or a large, general purpose tool that nobody wants to maintain. You either page someone or try to uncover the bug yourself.
Closed-Loop Automation is Critical for Microservices
With Splunk Infrastructure Monitoring and Amazon EventBridge, the benefits are indeed a breakthrough. What took minutes, if not hours, to get actionable responses from traditional observability tools, the Splunk Infrastructure Monitoring real-time streaming analytics architecture processes even the most sophisticated alert detectors within seconds. The SignalFlowTM analytics engine, at the core of the Splunk Infrastructure Monitoring observability platform, is uniquely equipped to address the complexity and volume of data that modern environments impose. Thanks to patented data science SignalFlowTM creates a dynamic view of the environment and identifies true outlier conditions in seconds.
Amazon’s EventBridge interface completes the picture, providing an advanced, scalable dispatch system capable of routing and handling events at any scale. Now, your systems can react in seconds so that the incident response is underway before a human could even realize something needs attention. After the incident is already remediated, you can then start to troubleshoot. Splunk Infrastructure Monitoring's Outlier AnalyzerTM enables rapid root-causing. By analyzing every single transaction across your microservices and correlating across your application code and infrastructure it isolates your problem with one click.
Setting up the Splunk Infrastructure Monitoring Integration with EventBridge is as easy as 1, 2, 3:
- Select Amazon EventBridge from the integrations page in Splunk Infrastructure Monitoring and enter your Amazon organization ID and AWS Region in the configuration:
- Associate the Event Source in AWS:
- Configure a detector in Splunk Infrastructure Monitoring and set your Amazon EventBridge integration as the target of the detector:
That’s all you need to do to get events flowing into AWS!
Learn more about Splunk Infrastructure Monitoring and get a 14-day free trial.