If you’re working with microservices in a large distributed environment, you’ve probably got your monitoring and logging on lock, and you may even be lucky enough to have properly instrumented APM (distributed tracing) for consumer calls. But, did you know you’re likely still facing an observability gap?
How many incidents have you worked that required hours of sleuthing only to end with a single team needing to roll back a deployment? It’s more common than you may think!
As previously mentioned in the Splunk Blog post: Jenkins, OpenTelemetry, Observability unless you’re leveraging Events and Alerting to their utmost potential you’re likely missing a crucial element of your software lifecycle and CI/CD processes. Event based CI/CD data can mean the difference between minutes long MTTD/MTTR and hours!
With our new Azure DevOps integrations for sending Events to Splunk Observability and Alert based Release Gating your organization can start integrating CI/CD context into your monitoring practices!
Working an Incident and Leveraging Deployment Events and Alerts
Take a moment to ask yourself:
- “Do I know when my software is being deployed?”
- “How many clicks do I have to invest to know when a deployment happened?”
- “Do I know when upstream services my software depends on are being deployed?”
- “How do I know when those service’s deployments are impacting my software?”
Events in Splunk Observability are highly visible, easily overlaid on dashboards with event markers / lines, and are essentially “free real estate” with no specific associated charges. Imagine how much more context you can include on your dashboard with Event Markers for deployment start, deployment success, deployment failure, etc. Not just for your own services, but also upstream services you depend on!
Figure 1-1. Quickly view CI/CD events from Azure Devops overlaid on your dashboard charts
How Can my Entire Tech Organization Benefit from CI/CD Events?
Successful monitoring (and by association incident management) is all about context and communication! By helping teams to quickly establish if a deployment has impacted their service’s performance, or another service’s performance, your software teams can decrease Mean Time To Detect (MTTD) and Mean Time To Recovery (MTTR).
- IT Operations / Support Analysts: As a member (or Head) of the IT Operations team, I want up to date build information on important services, to quickly notify Software teams if recent deployments are causing or related to a service interruption.
- DevOps / SRE: As a member (or Head) of a DevOps or SRE team, I need to ensure services are healthy, and if not, quickly track down the cause. The ability to provide stakeholders with visibility into issues caused by deployment of applications and infrastructure will help them improve their software development and deployment practices.
- Software Developers: As a member (or Head) of a Software Development team, I want to know if a recent deployment of my own software or upstream services is causing an issue, before customers are impacted and without jumping between different tools and UIs.
Less jumping between multiple tools and increased context in a single UI is the name of the game! But, there are further benefits to be had with another of our Azure DevOps integrations.
Try out the Splunk Observability Events integration from the Microsoft Marketplace!
Trouble at the Gates! Alert!
As mentioned above there is great value in creating alerts and/or events related to software deployments. Knowing when your service or upstream services have been deployed is a vital signal for Development, SRE, CI/CD, and DevOps teams. But, the next logical step is to be proactive and start gating your releases based on Splunk Observability Alerts.
For example the configuration shown in Figure 1-2 below has three steps. First it sends an event to Splunk Observability on pipeline start so that deployment is marked in Observability, then checks our service’s alert to make sure it isn’t firing, and finally checks the upstream service’s alert to make sure it isn’t firing either, before finally moving through the rest of the release process.
Figure 1-2. Setup Deployment gates based on Splunk Observability Alerts
Gating releases based on alerting for your own services is a natural place for teams to start. But, this may be counterproductive during incidents involving your service. The last thing you need during a stressful incident is something preventing you from deploying a fix. In these cases gates in Azure DevOps pipelines can be easily ignored with a single click in the interface.
More useful in most cases is gating your deployment or release based on the health of upstream services that influence the health of your own software. Below you’ll find a list of examples that may make your release gating practices more effective:
- If your company’s edge router is having issues, you’re going to have a hard time determining that a given deployment is healthy or not. Consider holding off on deploying until those issues and associated alerts have settled down and consumer responses are flowing again.
- If you depend on another team’s service for correct and timely responses you may consider a release gate based on the health of their service. The last thing you want to do is to compound an incident already in progress by deploying more changes into the environment.
- Dependencies in the cloud, fully managed or otherwise, such as DynamoDB or BigQuery can also face outages. Gating on these sorts of things will prevent deploying changes when circumstances outside of your control make establishing the health of a deployment more difficult.
- Alerts based on Splunk APM, Splunk Synthetics, or Splunk Real User Monitoring (RUM) metrics will give you a release gate focused more closely on a specific web property or user experience. For more advanced organizations this can function as a final check directly related to how customers interact with your software.
Release Gates are a prime tool for protecting the integrity of your overall software environment. The ability to prevent further changes during ongoing incidents will help protect the availability KPIs of your service. Proactively preventing errant deployments during times of trouble will also likely make you some friends with your co-workers in incident management!
Try out the Splunk Observability Alert Gate integration from the Microsoft Marketplace!
Onward To The Future State!
As noted, events and alerts in Splunk Observability have no specific cost associated with their ingest, storage, or usage. This sort of “free real estate” goes untapped by most organizations but can provide enormous value to software teams, incident management, and CI/CD processes.
Events marking deployments, releases, and even infrastructure changes can provide much needed context to your monitoring.
Alerts, commonly used to indicate service health and notify on-call resources, can spread some of that contextual awareness of your software environment to your deployment pipelines.
But, alerts and events need not be constrained to internal factors. Alerting and event reporting based on other Splunk Observability products can help provide even more untapped context to your organization.
- Synthetic Testing: “Is the AWS Status page showing US-West-2 is down? When did that start?”
- Real User Monitoring (RUM): “It looks like mobile users are seeing a slow and steady increase of latency in the EU that starts around the time we changed our CDN configuration last Tuesday…”
- APM: “Users started having intermittent issues finishing checkout after last month’s AMI changes.”
Start looking without, in addition to within, to get a more detailed understanding of the impacts a given change to SaaS service, vendor’s configurations, or external processes may impact your software!
Excited about context? Interested in getting more monitoring information in a single pane of glass and generally pushing your DevOps (or DevSecOps) Magic™ to the limit? Check out Splunk Observability!
You can sign up to start a free trial of the Splunk Observability Cloud suite of products today!
This blog post was authored by Jeremy Hicks, Observability Field Solutions Engineer at Splunk with special thanks to: Doug Erkkila, Adam Schalock, Todd DeCapua, and Joel Schoenberg at Splunk.