Cyclical Statistical Forecasts and Anomalies - Part 3

Platform March 20, 2018 Splunk

Splunk customers are awesome and often come up with interesting new methods for building analytical workflows in Splunk.

Splunk customer Michael Fisher presented a fantastic technique for his .conf2016 presentation—"Building a Crystal Ball: Forecasting Future Values for Multi-Cyclic Time Series Metrics in Splunk"—using techniques he dubbed “Cloning” and “Time Travel.” It's pretty compelling stuff! I ran across Michael's work months after he presented when my attempts to build the same workflow were running into scale problems; I've used his techniques ever since.

In this example, I want to forecast the future and create interesting anomalies for alerts as the future becomes the present, but I want to also smooth my data and add business rules. I want to use data from the past exactly like we have in the past, but this time I want to get data from around my keys in the past—i.e. I want to get data from the 30 minutes before and after 12:15pm on Thursdays so I can smooth those behaviors. We are going to use this same technique in further Machine Learning Toolkit (MLTK) examples for creating interesting features, a key requirement for any machine learning solution.

To make this even more complicated, I don’t want to assume the behavior of my data is normal; that is, I don’t want to assume a bell curve of behavior. Toufic Boubez, our VP of Machine Learning and Incubation Engineering, presented on this topic at .conf2016 as well in his session, "A (VERY) Brief Introduction to Machine Learning for ITOA," explaining why that’s important.

In addition, I want a repeatable workflow made of macros so I can reuse the whole workflow again and again for different forecast periods and confidence levels or threshold multipliers.

That’s a big list of requirements, but it can be done! Let’s break it down.

Pro Tip: When using the timechart command before this macro, remember cont=f if you are trying to preserve your events with nulls; or if you want to use multiple entities through the same macro, use bin and stats like:

..

| bin _time span=10m

| stats count by EntityField, _time

Snippet #1: I am going to clone data so I can smooth/average out around a specific time slot. In our previous forecasting examples we only used values from the past that were directly linked to our target time period—for 1pm on a Monday we only used the values from 1pm on all past Mondays. What if I want to use values around 1pm on Monday in the past to smooth my forecasting? I am going to need to clone the data from around the target time in the past.

Snippet #2: Create any custom field creation—like upper and lower—add any individual outlier removal rules—like |eval count=if(count>100,100,count). Do not use |outlier, as it doesn’t have a by clause.

Snippet #3: As multiple events have the same time field (not _time, we have a new field called time), and all the values in that field have future time values (relative_time, just like our last examples), we can now aggregate any custom field to the future time point.

Snippet #4: Business rules and most importantly, Chebyshev's inequality, which assumes nothing about the future distribution. Explicitly, this is where we are creating a statistical forecast without specifying whether the future will be a normal curve or not. Note the example uses 90% confidence, and if you decrease your confidence levels downward the bands will become tighter around the “average."

You can replace this step with any of the examples from the MLTK outlier detection assistant as well, if you want absolute median deviation. This is where you change the lowerBound field to have a minimum value of say 1 if lowerBound<1.

Pro Tip: Make sure to persist your NULLS if that is part of your workflow.

Snippet #5: We are replacing _time with time, making our time travel complete. Exactly like we did in the last post for forecasting a single day into the future.

Snippet #6: We just want to clean up with a time chart in case we want a new aggregation, and finally we remove all the empty events that have no results. We are now ready to save this search as a macro, and then call the macro with a collect() command, maybe even with a map command if we have to.

Macro version:

….

|timechart count span=10m

Snippet #1:

eval w=case(
(_time>relative_time(now(), "$reltime$@d-5w-30m") AND _time<=relative_time(now(), "$reltime$@d-5w+$days$d+30m")), 5,
(_time>relative_time(now(), "$reltime$@d-4w-30m") AND _time<=relative_time(now(), "$reltime$@d-4w+$days$d+30m")), 4,
(_time>relative_time(now(), "$reltime$@d-3w-30m") AND _time<=relative_time(now(), "$reltime$@d-3w+$days$d+30m")), 3,
(_time>relative_time(now(), "$reltime$@d-2w-30m") AND _time<=relative_time(now(), "$reltime$@d-2w+$days$d+30m")), 2,
(_time>relative_time(now(), "$reltime$@d-1w-30m") AND _time<=relative_time(now(), "$reltime$@d-1w+$days$d+30m")), 1)
| eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-20m,+"+w+"w-10m,+"+w+"w-0m,+"+w+"w+10m,+"+w+"w+20m,+"+w+"w+30m,")
| where isnotnull(shift)
| makemv delim="," shift
| mvexpand shift
| eval time=relative_time(_time,shift)

Snippet #2:

| eventstats avg($val$) AS pred  stdev(pred) as stdev_pred by time

| eval upper=if($val$>pred,$val$,pred)

| eval lower=if($val$<pred,$val$,pred)

Snippet #3:

| stats avg($val$) AS pred, stdev(upper) AS ustdev, stdev(lower) AS lstdev by time

Snippet #4:

| eval lowerBound=pred-lstdev*(sqrt(1/(1-$confidence$/100)))

| eval upperBound=pred+ustdev*(sqrt(1/(1-$confidence$/100)))

Snippet #5:

| eval _time=time

Snippet #6:

| timechart span=10m min(pred) as pred , min(lowerBound) as lowerBound, min(upperBound) as upperBound

| search pred=*

Pro Tip: Remember your python back-fill command. We can backfill searches using this macro to simulate what kind of values we would have seen - just like we have repeatedly done in this blog series.

Here’s Michael’s output again, from his presentation:

This is complicated stuff, so let's do another example using the CallCenter.csv sample data. Remember, we should have it in an index from the last blog post, "Cyclical Statistical Forecasts and Anomalies - Part 2," or you can use the |inputlookup bit of SPL to create the upperBound, lowerBound, and pred, again from the previous blog entry.

Permutations (Example)

What if I have more than one entities and I don’t like using |map?
What if I want to persist my nulls and make sure my lowerBound is never below 0?
What if I want a tighter fit between the upper and lower bounds, say 80%?
What if I want 15-minute bins instead of 10-minute bins?
What if I want to use this example with the CallCenter.csv data set?
What if there can’t be a negative number of calls, so I should?

Here is what that might look like:

index=callcenter | where source="si_call_volume"

| bin _time span=15m

| stats count by source, _time

| eval w=case(

   (_time>relative_time(now(), "+1d@d-5w-30m") AND _time<=relative_time(now(), "+1d@d-5w+3d+30m")), 5,

   (_time>relative_time(now(), "+1d@d-4w-30m") AND _time<=relative_time(now(), "+1d@d-4w+3d+30m")), 4,

   (_time>relative_time(now(), "+1d@d-3w-30m") AND _time<=relative_time(now(), "+1d@d-3w+3d+30m")), 3,

   (_time>relative_time(now(), "+1d@d-2w-30m") AND _time<=relative_time(now(), "+1d@d-2w+3d+30m")), 2,

   (_time>relative_time(now(), "+1d@d-1w-30m") AND _time<=relative_time(now(), "+1d@d-1w+3d+30m")), 1)

| eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-15m,+"+w+"w-0m,+"+w+"w+15m,+"+w+"w+30m,")

| where isnotnull(shift)

| makemv delim="," shift

| mvexpand shift

| eval time=relative_time(_time,shift)

| eventstats avg(count) AS pred by time, source

| eval upper=if(count>pred,count,pred)

| eval lower=if(count<pred,count,pred)

| stats avg(count) AS pred, stdev(upper) AS ustdev, stdev(lower) AS lstdev by time, source

| eval lowerBound=pred-lstdev*(sqrt(1/(1-80/100)))

| eval lowerBound=if(lowerBound<0, 0, low)

| eval uppperBound=pred+ustdev*(sqrt(1/(1-80/100)))

| eval _time=time

| timechart span=15m useother=false limit=0 cont=false  min(pred) as pred , min(uppperBound) as high, min(lowerBound) as low by source

| makecontinuous _time

Note the changes to the last two lines—the timechart flags and makecontinuous. In this case, I want the nulls to continue forward as gaps where my forecast couldn’t get data for whatever reason.

What if the customer doesn’t want smoothing using the future data in the Snippet #1 section above? Simply remove the positive time shifts on the line:

| eval shift=case(isnotnull(w),"+"+w+"w-30m,+"+w+"w-15m,+"+w+"w-0m,")

To view the back weeks over time, so you can see the values being smoothed, use:

|timechart ….

|timewrap series=relative 1w

Open up a line chart, set multiseries option in format so you can visually inspect the data being used.

Alerting

Alerting hasn’t changed much from the previous section. We have a summary index and we want to push into that summary index the real values as they occur.

index=….

| bin _time span=15m

| stats count as Actual by host,_time
| collect index=...

Then the alerting search looks at the summary index and fires off an alert when Actual was above or below our thresholds, or even when compared to our new bound, pred. Yes, we will be changing pred in future posts to multivariate predictions.

Organization

Just as Michael suggests in his workflow, we can save these searches to macros and arrange the scheduled searches via those macros quickly.

index=… time range

… time chart or stats with a time command

| `ML_Forecast_Macro(count,90,+2d,2)`

| collect index=blah

Immediate Use On Your Installation of Splunk

Every copy of Splunk has an index=_internal. Run the above macro looking for the count of each host, or each sourcetype, or each source and look at the created curves. Do we have a cyclical forecast that makes sense? How would you run your Splunk deployment differently if you had a forecast of each source/host/sourcetype’s usage? What if one of those sources is under the forecasted threshold? What if it is above?

Holidays and Debugging

Exactly like in our last blog post in this series, you can customize the holidays from a learned or SME-set threshold value.

Conclusions

That ends the basics of statistical anomalies and forecasting in Splunk; you should have plenty of brilliant alerts for single values moving through time. There are a thousand uses for this technique, so it’s a good thing you created macros to reuse it! You’ll find this approach tackles a lot of use cases but there might be some that it doesn’t.

Don’t worry, this is only the beginning of the clever and sophisticated predictions and alerts you can craft using Splunk and the Machine Learning Toolkit. We have many more stories from our customers and Splunkers who are solving real-world problems every day using the power of this solution. I will share more of those stories and solutions in the next series of blog posts.

Until then, happy Splunking!

Special thanks to the Splunk ML Customer Advisory team including Andrew Stein, Brian Nash and Iman Makaremi.

----------------------------------------------------
Thanks!
Manish Sainani

Style

two-column

Splunk Edge Processor Enhancements Offer Greater Data Access and Improve Data Management

Platform

1 Minute Read

Splunk Edge Processor Enhancements Offer Greater Data Access and Improve Data Management

On the heels of an exciting GA in March and the April announcement of its regional expansion, we are excited to share the latest updates to Splunk Edge Processor that will make it even easier for customers to have more flexibility and control over just the data you want, nothing more nothing less.

Platform

3 Minute Read

Accelerate Productivity With Updates in Your Platform UI Home Page

A run-through of the experiences added to the redesigned home page in the most recent versions of Splunk Cloud Platform and Splunk Enterprise.

Stream Your AWS Services Metrics to Splunk

Platform

2 Minute Read

Stream Your AWS Services Metrics to Splunk

Amazon Web Services (AWS) recently announced the launch of CloudWatch Metric Streams. Cloudwatch Streams can stream metrics from a number of different AWS resources using Amazon Kinesis Data Firehose to target destinations. What this means for current Splunk customers is they now have the option of either using the Splunk add-on of AWS to poll metrics or to make use of this new service and let Amazon Kinesis Data Firehose push metrics to a Splunk HEC endpoint, and reduce their latency by anywhere between 5 to 10 minutes.

/en_us/blog/fragments/about-splunk

/en_us/blog/fragments/subscribe-footer

Cyclical Statistical Forecasts and Anomalies - Part 3

Permutations (Example)

Alerting

Organization

Immediate Use On Your Installation of Splunk

Holidays and Debugging

Conclusions

Related Articles

Splunk Edge Processor Enhancements Offer Greater Data Access and Improve Data Management

Accelerate Productivity With Updates in Your Platform UI Home Page

Stream Your AWS Services Metrics to Splunk