DEVOPS

SignalFlows to SLOs

How are you tracking the long-term operation and health indicators for your micro and macro services? Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are prized (but sometimes “aspirational”) metrics for DevOps teams and ITOps analysts. Today we’ll see how we can leverage SignalFlow to put some SLOs Error Budget tracking together (or easily spin up same with Terraform)!

Depending on your organization SLOs may take many forms:

  • Application teams measuring the success/failure rates of various services.
  • Teams setting aspirational goals for services (“If we track above SLO for multiple periods we can build trust and start more work on new features!”)
  • Provide a common framework for DevOps, Operations, and Management to understand long-term service health.
  • Downtime and/or Error budgets. By budgeting a certain amount of downtime or “error minutes” to each service teams can better prioritize work on operational health vs new features.

If you’ve read the Google SRE handbook (and who hasn’t?) you’ll be familiar with SLO and SLI. But if you haven’t, try this quote on for size.

 

"SREs’ core responsibilities aren’t merely to automate “all the things” and hold the pager. Their day-to-day tasks and projects are driven by SLOs: ensuring that SLOs are defended in the short term and that they can be maintained in the medium to long term. One could even claim that without SLOs, there is no need for SREs." - Google SRE Handbook Chapter 2

 

As stated SLx (SLO/SLI/SLA) is most often concerned with trends over time. Terms such as “Error Minutes”, “Monthly Budget”, “Availablity per quarter”, and the like are common in discussions of SLx. So how do we use our charts, which track trends over time, and our alerts, which notify us of events in time, to create an SLO? We can use the features of SignalFlow!

Figure 1-1. Example Error Budget using Alert Minutes

 

SignalFlow? Here We Go…

So you want to create an SLO, specifically based on the concept of “Downtime Minutes”. What ingredients would you need to cook that up?

  1. “We need a way to mark times when our service traffic is dipping below 99% success rate”
  2. “We need a way to convert #1 to minute(s) long increments or Downtime Minutes” 
  3. “Once we have Downtime Minutes we need to track those on a Monthly/Quarterly period against a budgeted number of Downtime Minutes”

With what we’ve listed above it sounds like #1 means some kind of alert on “success rate”. #2 means a way to force alerts to happen every minute while in an alerting state. While #3 would take the count of those minute long alerts and compare it against a constant number (the budgeted number of Downtime Minute). 

Fortunately, SignalFlow gives us the ability to create these sorts of minute long alerts and a way to track the number of alerts in a given cyclical period (week/month/quarter/etc.)

 

Alerts and Increments

Signalflow allows us to track Alerts as a timeseries using the `alerts()` function to count the given number of alerts during a period of time.

This requires:

  1. An alert that only ever goes off for a minute at a time
    1. During long stretches of time where alert would be triggered have it turn on/off every minute
    2. Likely use a rate for this alert (Error rate is most common)
  2. A charting method than can munge Alert data and use stats functions on it
    1. This includes a way to reset the “count” monthly/quarterly/weekly/etc (more on this later)

Alert SignalFlow Example:

filter_ = filter('sf_environment', '*') and filter('sf_service', 'adservice') and filter('sf_kind', 'SERVER', 'CONSUMER') and (not filter('sf_dimensionalized', '*')) and (not filter('sf_serviceMesh', '*'))
A = data('spans.count', filter=filter_ and filter('sf_error', 'false'), rollup='rate').sum().publish(label='Success', enable=False)
B = data('spans.count', filter=filter_, rollup='rate').sum().publish(label='All Traffic', enable=False)
C = combine(100*((A if A is not None else 0)/B)).publish(label='Success Rate %')
constant = const(30)
detect(when(C < 98, duration("40s")), off=when(constant < 100, duration("10s")), mode='split').publish('Success Ratio Detector')

 

What are we doing in this Alert?

  1. Define a filter: We’re using APM data so this filter is mostly just targeting down to a specific service “adservice”
  2. Create our “Success Rate” (the inverse of error rate) so we can alert if we go below that number.
    1. A gets ONLY successful requests
    2. B gets ALL requests
    3. C makes the Success Rate
  3. Make our detector for Downtime Minutes (Minutes below Success Ratio threshold).
    1. Make a constant value (could be anything). We use this to turn off the detector if it has been on for more than 10 seconds.
    2. Detector triggers when Success Rate (C) is below 95% for 40 seconds
      1. Use the off= setting along with the mode=’split’ setting to reset the detector if two conditions are met. 
        1. Detector is currently Triggered
        2. Check if constant has been less than 100 for 10 seconds

Essentially we have made an alert that will fire once every minute that the metric is breaching the alertable threshold.

Charts and Cycles

Next, we need to create a chart that tracks these Alert Minutes. Additionally, we’d like to have that chart reset monthly so we can know if our Error Budget for the month has been used up.

Error Budget Charting Examples:

## Chart based on detector firing
AM = alerts(detector_name='THIS IS MY DETECTOR NAME').count().publish(label='AM', enable=False)
alert_stream = (AM).sum().publish(label="alert_stream")
downtime = alert_stream.sum(cycle='month', partial_values=True).fill().publish(label="Downtime Minutes")
## 99% uptime is roughly 438 minutes
budgeted_minutes = const(438)
Total = (budgeted_minutes - downtime).fill().publish(label="Available Budget")

What are we doing here?

  1. Define a metric stream from the Detector we created before.
    1. Take that alert stream and sum it. We want all the minutes!
  2. Create your “downtime minutes” stream 
    1. Sum the summed alert stream with a cycle= of month/week/etc and allow partial_values just in case.
    2. Use fill() to fill any empty data points with the last value
  3. Create a constant for the number of minutes in our “error budget” or “downtime budget”
  4. Create Total Available Budget
    1. Subtract downtime stream from budgeted minutes constant value
    2. Use fill() to fill any empty data points with the last value

 

Figure 1-2. Example charting of Alert Minutes

 

Better Together, Observing Forever!

SignalFlow is an incredibly powerful tool! Some of its advanced features can lead you to really interesting discoveries!

As seen in the example above we are able to use SignalFlow to create both Alerts and Charts that work together to create a new way to view our SLO/SLx concerns. For more detailed examples and deep SignalFlow usage check out the SignalFlow repo on Github.

Next Steps

Splunk Observability provides you with nearly endless possibilities! Think of this article as a jumping off point for using SignalFlow in more advanced ways in Splunk Observability.

To easily get these types of SLO / Error Budget tracking functions in Splunk Observability using Terraform, checkout the Observability-Content-Contrib repo on GitHub! If you’re doing something cool with Splunk Observability, please consider contributing your own dashboards or detectors to this repo.

If you haven’t checked out Splunk Observability you can sign up to start a free trial of the Splunk Observability Cloud suite of products today!


This blog post was authored by Jeremy Hicks, Observability Field Solutions Engineer at Splunk with special thanks to: Bill Grant and Joseph Ross at Splunk.

Jeremy Hicks
Posted by

Jeremy Hicks

Jeremy Hicks is an observability evangelist and SRE veteran from multiple Fortune 500 E-commerce companies. His enthusiasm for monitoring, resiliency engineering, cloud, and DevOps practices provide a unique perspective on the observability landscape.