Important Update: While the concepts covered in this blog post are still applicable, I'm pleased to announce that we've released an ITSI content pack for monitoring and alerting which provides all the functionality of this blog in pre-packaged format. For more information on how to get started with this content pack, review the content pack documentation or check out the blog covering the release of this content pack.
I’ve previously authored several blog posts covering thresholding basics and alerting best practices in Splunk IT Service Intelligence (ITSI). In those posts, I focused on foundational concepts and left a lot of implementation details to interpretation; moreover, as my experiences and methodologies evolve, so too does my guidance.
In this blog post, I intend to get a lot more prescriptive and lay out a blueprint for enterprise-wide alerting across all your services. We’ll zoom out from single-service or single-KPI based alerts and generate a design that is uniform and applicable to all services and KPIs in your ITSI environment. I believe that you’ll quickly see the benefits of this design, ranging from performance to maintainability to flexibility.
Interestingly enough, this design happens to mirror a popular risk-based security design strategy which was discussed at .conf18 called “Say Goodbye to Your Big Alert Pipeline, and Say Hello to Your New Risk-Based Approach.” If you buy into the design layed out in this blog, I encourage you to watch the replay of that talk. Ideally, you'll draw several correlaries between their approach and mine, and you will uncover even more alerting ideas.
To that end, I foresee the guidance within this blog further evolving toward that risk-based approach, and it’s possible that the technical details of my design change slightly or perhaps even dramatically over time as the product and methodologies evolve. Nonetheless, if you’re actively increasing the number of services, KPIs, or alerts in your environment this strategy will probably feel like a step in the right direction and it’s time to consider changing your approach.
The alerting design involves two major concepts. So before we dive deep, an overview of the design and those concepts is warranted:
Concept 1: Create, in fact proliferate, notable events for any noteworthy changes to services, KPIs, and entities. We’ll depend heavily on custom correlation rules to achieve this. Additionally, we’ll build each correlation rule to evaluate across all services, KPIs, and entities leading to a performant, maintainable, and uniform implementation across our environment.
Concept 2: Apply attributes to notable events to facilitate grouping and alerting logic. Attributes are nothing more than field/value pairs present in the itsi_tracked_alerts index. We’ll depend on typical core Splunk concepts to achieve this, such as lookups, calculated fields, and eval statements in our correlation searches. Once present, these attributes can be leveraged in notable event aggregation policies, alert action rules, and the episode review.
Putting it all together, it looks like this… We’ll build multiple correlation searches looking for bad stuff happening in our services, KPIs, and entities. When our rules detect bad stuff, notable events will be created. We’ll apply various attributes to these notable events, allowing us to group related notables using aggregation policy logic to cut down on the noise. And lastly, we’ll configure alert actions in our aggregation policies to produce alerts to the NOC based on our desired alerting rules.
Indexes to Know
Like all things Splunk, ITSI stores much of its data to several key indexes and our configurations and correlation rules will reference them; here’s a quick overview of the key indexes used by ITSI and what data is stored within:
- itsi_summary – Each time ITSI runs a KPI or service health score search, the resulting information including value and severity is stored here
- itsi_tracked_alerts – Each notable event produced by ITSI and all its corresponding information is stored here
- itsi_grouped_alerts – When one or more notable events are grouped together by a notable event aggregation policy, the grouping information, specifically group_id and event_id, are stored as events here
A Step-by-Step Approach
Because our marketing team likes bite-sized blogs and because you don’t need to eat this elephant all at once, I’ve broken out the design into five steps. Each step will be its own blog, and once you’ve completed the fifth step, you’ve effectively implemented the approach and are free to alter and augment as you see fit. The five steps are:
- Create notable events when service health scores degrade
- Apply an alert_group attribute to notable events to group related notables
- Create additional correlation searches for other noteworthy situations
- Build in alert rules with the alertable attribute
- Apply throttling to alert once per episode
Be Ready to Test…and Here’s How
As you try this out and make changes to your environment, you’ll want to test early and often. The customer I was working with had a very simple and effective method for testing that I’ll share. Simply create a test service with one or more test KPIs. When you need to break a service for testing purposes, use your test service and modify the threshold values to simulate failure. Similarly, as we start building up our notable event aggregation policies (NEAPs), you can build a test NEAP which includes only notables from your test service. This provides a very simple and isolated environment to test your changes.
Ready? Go on to Step 1...