IT

A Blueprint for Splunk ITSI Alerting - Step 5

We’re officially in the home stretch of the "A Blueprint for Splunk ITSI Alerting" blog series; if you’ve made it this far, I’m going to assume you’re willing to finish out this last step. In this blog post, we’re going to focus on throttling our alerts. From the previous step, we’re now producing actionable alerts, but without throttling, we’re producing them way too often. Ideally we want to alert once per episode or maybe once an hour or day, but certainly not once per notable.

I’m going to come right out and say that I wish the throttling was a little more simple and straightforward to achieve in NEAPs, but it is what it is. So the steps in this blog might feel a little kludgy, and you might have to work a little to understand what’s going on here.

An English Description First

Let me start by using words to describe what we're going to do, and then we’ll actually do it.

We’re going to create one final correlation search in the environment that regularly looks at each episode group and creates a special “Alert Trigger” notable event for any group that has at least one notable whose field alertable = 1. This correlation search will be smart enough to produce just one of these special “Alert Trigger” notables, thus providing throttling. Lastly, we’ll update our NEAP action logic to trigger based on the presence of this special alert trigger notable and voila, we’re done!

The Alert Trigger Correlation Search

This one is a bit ugly, so bear with me…

((index=itsi_grouped_alerts) OR (index=itsi_tracked_alerts alertable=1))
| eventstats values(itsi_group_id) as itsi_group_id by event_id
| search index=itsi_tracked_alerts
| mvexpand itsi_group_id
| lookup itsi_notable_event_group_lookup _key AS itsi_group_id OUTPUT severity AS lookup_severity, status AS lookup_status, owner AS lookup_owner
| eval severity=coalesce(lookup_severity, severity), status=coalesce(lookup_status, status), owner=coalesce(lookup_owner, owner)
| search status < 5
| eventstats count(eval(alertable=1)) as alertable_count count(eval(alert_trigger=1)) as alert_trigger_count by itsi_group_id
| where alertable_count>0 AND alert_trigger_count<1
| dedup itsi_group_id
| eval alert_trigger=1

Configuring the Suppression Timeframe

You’ll notice in the images above that this search is configured to run over the last 24h of notable events. This timeframe is critically important because it dictates the throttling duration. If the episode has been open for more than 24h, then the alert_trigger notable event will no longer return in the search and a new one will be created; thus a new alert action will fire. If you want to shorten or lengthen the throttle window, simply run this search over shorter or longer durations to match your desired throttle time.

Even Smarter Suppression

The correlation search above is wholely responsible for when and how we suppress. As a result, we can continue to evolve the rule logic to make our suppression even smarter. For instance, if my episode has been "open" for more than 24 hours, but the most recent alertable notable is fairly old (let's say 10 hours), we may not want to trigger a new alert. We can embed that type of logic also into the correlation search such that we'll only alert if our most recent notable is fairly new—let's say less than 60 minutes old. Here's how we would update our alert trigger correlation rule to include these:

((index=itsi_grouped_alerts) OR (index=itsi_tracked_alerts alertable=1))
| eventstats values(itsi_group_id) as itsi_group_id by event_id
| search index=itsi_tracked_alerts
| mvexpand itsi_group_id
| lookup itsi_notable_event_group_lookup _key AS itsi_group_id OUTPUT severity AS lookup_severity, status AS lookup_status, owner AS lookup_owner
| eval severity=coalesce(lookup_severity, severity), status=coalesce(lookup_status, status), owner=coalesce(lookup_owner, owner)
| search status < 5
| eventstats count(eval(alertable=1)) as alertable_count count(eval(alert_trigger=1)) as alert_trigger_count max(_time) as latest_alertable_time by itsi_group_id
| eval seconds_since_last_alertable_notable = now() - latest_alertable_time
| where alertable_count>0 AND alert_trigger_count<1 AND seconds_since_last_alertable_notable < 3600
| dedup itsi_group_id
| eval alert_trigger=1

 

Conclusion

Well, that’s it! No more steps, so you should be rockin’ and rolling now. I hope you see the same benefits in this design as I do. It’s fairly easy to maintain, it’s quite performant, and the alerting design is universal across all services. All goodness to me. Understand that this is a starting point—a design blueprint—and you should alter and augment as this situation dictates for your environment. Again, if you want to connect with me to discuss more, please contact me via the LinkedIn link in my author bio below.

Jeff Wiedemann
Posted by

Jeff Wiedemann

Prior to Splunk, Jeff spent years as an architect at a healthcare software company where he got his first Splunk contact high. As it turns out, analyzing seemingly incoherent data, continuously identifying new insights, and making sound data-driven decisions can be quite fun. Nerd alert! When not Splunking, Jeff might be doing something relaxing and fun, but more likely than not, he's got his hands full with his two boys.

TAGS
Show All Tags
Show Less Tags