Excited about your shiny new Splunk IT Service Intelligence (ITSI) license? Well, you should be! But navigating from your first service creation to meaningful and trusted alerts takes some care and planning. In this multi-part blog, we'll outline some practical guidance to get you going. Starting of course...with the basics.
The ITSI Hierarchy of KPIs and Services
To understand how service issues will ultimately result in meaningful alerts, we should briefly revisit the hierarchy of KPIs and services. We'll refer back to these concepts during alert configuration, so having a basic working understanding of this hierarchy is important. ITSI Services are designed as follows:
- ITSI Service
- Service health score
- KPIs (Optional)
- Dependent services (Optional; sometimes referred to as subservices)
Each service will always have a health score, which is computed based on the status of the KPIs and subservices defined for that service. KPIs are optional and when defined, will require threshold configurations. Dependent services are also optional and are simply references to other already configured ITSI services on which this service depends.
Thresholds vs. Alerts
Let’s first clarify the difference between thresholds and alerts—in ITSI, these are related but separate concepts. Thresholds apply only to KPIs; they dictate when a KPI severity (or status, as they are sometimes referred) changes from normal to critical, high, low, etc. KPI severities are viewable in the service analyzer dashboard, deep dives, and other UI locations, but in and of themselves don’t generate alerts.
Alerts are generated from additional configurations, driven from KPI severity and service health score changes. We’ll dig in to these configurations later, but for now, we just want to acknowledge the difference between the two concepts.
Alerts vs. Notable Events
At the risk of being pedantic, what is an alert anyway? Is it an email? A text message? A ticket to a ticketing system? A flashing red light in the NOC? Something else? Within ITSI, we take a two-tiered approach to generating alerts. Notable events are created first, which then lead to one or more traditional alerts. We configure ITSI to continuously monitor KPI statuses and service health scores; when we detect problems or concerns, we can then create notable events.
Notable events are not alerts—at least in the traditional sense—they are simply events of interest viewable from the Notable Events Review dashboard. It's a separate and final configuration using notable event aggregation policies that turn one or more notable events into your traditional alerts like emails, tickets, etc. Put the process all together and it looks like this:
Correct Entity Groupings Are Paramount
The entities selected for your service directly impact the aggregate and per-entity results for each KPI. Therefore, grouping the right entities together in your service is important to ensure success with thresholds and alerts. Typically, you’ll want to ensure that each entity in your service behaves about the same as every other entity. Predictably different entities should be broken out to their own subservice. Let’s use a handful of examples to clarify:
- Entities that span two different data centers should typically be broken out into DC-specific subservices
- Batch servers or dedicated purpose servers should typically be broken out from their general purpose counterparts
- Entities spanning different architectural tiers (DB, Web, Application, etc.) should be broken out
You may be asking, why is this the case? Let’s use the batch servers example above and assume I have a farm of 20 app servers associated to a critical business app that I’m monitoring in Splunk ITSI. Let's then assume that 3 of those 20 servers are solely responsible for processing nightly batch operations jobs. If all 20 servers are defined as entities in one service, KPIs like average CPU are nearly meaningless in aggregate (and therefore difficult to accurately threshold) because the batch servers are expected to exhibit different behavior. It also makes leveraging per entity thresholds and anomaly detection nearly impossible down the road. So instead, break the 3 batch servers out into their own subservice and keep the other 17 grouped together.
KISS (Keep It Simple)
Before we get into the meat and potatoes of how to configure thresholds and alerts, please remember to keep it simple. Generating too many KPIs or overly complex alerting policies will quickly feel like eating an elephant. With that in mind, a large part of the guidance below revolves around keeping it simple—at least to start.
Best Practices for Thresholding KPIs
Define which severities your organization will use
ITSI allows for 6 different severities—normal, critical, high, medium, low, info—but that doesn’t mean you need to use them all. In fact, to keep it simple, I’d recommend you don’t. If you use nothing but normal and critical, that’s probably not a bad start.
Try to maintain a consistent definition for each severity
What’s the difference between high and critical? As an organization, you should have an answer to that question, and every KPI should be thresholded to the same definitions. This will make the rules for alert generation and remediation processes much more consistent and maintainable in the long run, particularly as your ITSI instance grows. It also ensures that those responsible for monitoring Glass Tables and the Service Analyzer don't have to perform as much mental context switching between different services and KPIs.
Be careful—this problem is subtle and will creep up on you quickly. Be particularly careful when thresholding KPIs across different teams (who will have different perspectives on the meaning of each severity) or when using Splunk ITSI Module KPIs and pre-configured time policies which have predefined threshold values that may or may not align with your organization's definitions.
You don't have to use my definitions, but just to get you thinking about how you could define each severity, consider the following examples:
- Critical: A KPI in this status is absolutely unexpected and will immediately be configured to generate an alert
- High: A KPI in this status has exceeded the bounds of what we would consider normal but is not yet cause for an alert
Don’t Threshold Every KPI
You can effectively not threshold a KPI by choosing the info severity for all results. This allows the KPI to be present in ITSI (particularly useful in deep dives), but doesn’t affect the service health score computation. This is pretty useful for KPIs being monitored from other monitoring tools, whose values are never directly indicative of a problem, whose results cannot be consistently relied upon, or that you are absolutely lost on how to threshold.
Per Entity Thresholds vs. Aggregate
Per entity thresholds are interesting, but they are more complex than aggregate. I’d highly recommend sticking with aggregate thresholds to start. The downside of steering clear of per entity thresholds is that it’s difficult to catch a single entity that has gone off the rails and is being masked by the strong performance of the other entities. Consider solving that problem a little later down the road when you’re ITSI deployment is stable and humming and you’re certain it’s needed; at that point, you could consider per entity thresholds, anomaly detection, or a separate KPI which tracks per-entity behavior.
Choosing Static vs. Adaptive Thresholds
Everyone loves adaptive thresholds and if it’s a hammer, every KPI tends to look like a nail. But adaptive thresholding needs to be approached methodically and with a slightly different mindset around thresholding and alerting. Blindly turning on adaptive thresholding and “clicking through” the pre-defined threshold policies is a recipe for failure.
If you’re interested in how to get started with adaptive thresholding, keep an eye out for the next post is this series. But in the spirit of KISS, consider first relying a little more heavily on static thresholds and break into adaptive once you're comfortable with ITSI and see a clear need for it.
Time Policies and KPI Threshold Templates Are Your Friend
One of the most powerful aspects of KPI thresholding is the ability to flex the threshold configurations per hour, per day. These policies allow you to apply the most effective threshold algorithms and values to your KPIs at a very granular and organization specific level. For example, if your KPI should be thresholded during the workday from 8am to 5pm much differently than it should be thresholded over the weekend, create separate time policies to islolate those expected differences.
Once again, I must implore you, keep it simple here too; create as few time policies as absolutely possible to cover the majority of your organization and services. Also, try to keep the total number of time windows you create in the week as small as possible. After all, there are 168 hours in a week and that’s an awful lot of threshold configurations to maintain if you wanted to specify different configurations for every single hour of every day of the week.
What’s Wrong with Static Thresholds Anyway?
Okay, here’s the part where you boo and hiss me out of the room, right? But seriously, static thresholds aren’t evil. They are dead simple to configure and understand, and you’ve probably already got a good idea as to what the correct threshold values should be from experience. When you couple all this with the ability to select different static values at different times of the day and week by leveraging time policies, this really should be your go-to KISS thresholding strategy.
Use Adaptive Thresholding to Learn Static Values
If you hate me for what I just said, then probably the happiest medium that most people might not realize they can do is to have ITSI "learn and suggest" appropriate thresholds for a KPI using adaptive thresholding. Once computed, flip back to static values and tune as needed. This avoids some of the pitfalls of normal adaptive thresholding while still giving you a nice granular and machine learned starting point.
Best Practices for Verifying and Tuning Threshold Configurations
Excellent! You’ve gotten your services built and taken a stab at thresholding your KPIs. Now, how do you go about sanity checking your work before you release it into the wild? Let me give you some guidance on that too.
Remember, we’re still not yet talking about alerts, but it’s time to start thinking about them. For instance, if you’ve decided that you’ll be triggering alerts immediately when KPIs go critical and you also observe a particular KPI in a critical status 75% of the time, you probably thresholded it wrong.
Deep dive threshold validation
Deep dives display the computed KPI severity and also allow you to look back in time (at least for as long as the KPI has been defined and backfilled). What makes this awesome is that you can tune the threshold, return to the deep dive view and upon refreshing see the results of the new threshold configurations.
Work KPI by KPI using this method to validate the threshold results as thoroughly as you can. Here is where you want to validate that your organization's status definitions are being met. Again, if critical is going to trigger a page out in the middle of the night, you better be working to validate that you’re not seeing high volumes of unwarranted critical status KPIs.
A Deep Dive GOTCHA!
Sooner or later, you're going to get tripped up with the information you see in the deep dive. Remember that your KPI is running every 5 or 15 minutes; that's producing a lot of data points over time. So many data points in fact that, as you expand the deep dive time window, ITSI will visually aggregate those many KPI results into larger blocks of time.
To better illustrate this, the screenshot below graphs both the actual number of data points in a 24 hour period for a KPI running every 5 minutes, and also the summarized graph you'd see in a Deep Dive over the same 24 hours. As you can see, the deep dive summarization "hides" the peaks and valleys of the actual data unless you zoom in your time window enough. When summarization occurs, the KPI calculation metric selector at the top of the screen flexes the type of visual summarization you see over that block of time.
In short, when validating KPI thresholds, be sure to select small enough windows of time to ensure you're seeing the full granularity of the result history, or flex the KPI aggregation metric to the function that will best display the historical outliers for you to accurately validate KPI thresholds.
Compare with a recent past incident
While in the deep dive validating a KPI, it would behoove you to refer back to some recent incidents associated with the service. Use the time picker in the upper right to go back to when the incident first began. For instance, if you recently had a system problem affecting the ability for users to log in, you should clearly be able to see the login KPI go high or critical at the start of the incident, or see some other related KPI go critical around that time.
If everything remains green or doesn’t appear to rise to an alarm level, it could indicate your thresholds are too forgiving or that you're running across the deep dive gotcha described above.