Tips & Tricks

October 24, 2018

3 Minute Read

Alerts and Dashboards and Searching, Oh My!

By Burch

"So you're telling me you have an employee watching a dashboard at all times? How is that not expensive?"

"So you get these emails from your alerts, but there's no action to take when you get them? How is that not spam and causing you to ignore them all?"

"So everyone emails these searches to each other to run if you want to know if the system is stable? How is that not prone to human error?"

I've come across all of these quirks, and I get why. When you're a member of a technical team, you often do odd things to keep the system up; things that once worked, but your silly human brain makes you compelled to repeat—even if it's arguably irrational. Sometimes, you're in so deep you don't notice this silliness until someone else comes along and points it out to you. If you're having trouble relating, recall the Band-Aid™ cronjobs (or Scheduled Tasks for our Windows cousins) that you've set up.

Having a stronger practice around when and why you use Splunk's searches, alerts, and dashboards can make your Splunk usage dramatically more effective.

Incident Life-Cycle

"...there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. ..." - Donald Rumsfeld, 2012

Through my collaborations with Splunk users, we've come to recognize the circumstances that make certain Splunk product features the best practice at a given time for a given goal. Furthermore, such goals (when chained together) represent the life-cycle of an incident: symptom → root cause investigation → permanent fix; or, as we know things really work at an enterprise: symptom → root cause investigation → temporary workaround & monitoring → permanent fix. That "temporary workaround & monitoring" phase may be restarting a server when a known confluence of symptoms occur. It's a necessary evil given the reality that at an enterprise, there are change windows, approvals, red tape, and political polish to get any permanent fix created and applied.

As a visual thinker, I realized we had a 2x2 matrix showing root cause in relation to an issue occurring. Kind of like a Johari Window for incidents!

	Root Cause Unknown	Root Cause Known
Issue Exists Unaware	Q0	Q2
Issue Exists Aware	Q1	Q3

This matrix did a great job of capturing this life-cycle! All is quiet (Q0) until you learn of some odd behavior. When these symptoms occur, you don't know the root cause, but you're now aware that an issue exists (see Q1) so you start an investigation to uncover the root cause (still Q1). Once that is known, you enter a cycle of checking if the symptoms present themselves (Q2), and if so, applying a fix (Q3).

Splunk Features by Quadrant

So how might Splunk help here? Let's start by labeling each quadrant in accordance to what we've outlined thus far:

	Root Cause Unknown	Root Cause Known
Issue Exists Unaware	listening	monitoring
Issue Exists Aware	investigating	attacking

By recognizing what we know and don't know, we can identify what action to take in each phase.

Listening: Think of this as status quo. Business as usual. While you go about your every day activities, you may learn about a confluence of symptoms that are compelling. This discovery could occur as formally as an incident ticket that lands at your desk or as subtly as you merely noticing patterns or behaviors that, while you could not have anticipated, you know just aren't right. Think of the latter as noticing a Splunk dashboard or glass table that seems abnormal. The point being that a dashboard or glass table are great for exposing the simultaneous patterns and behaviors of your symptom that individually may be innocuous, but together (with your innate technical background and knowledge of your systems) tells you something is worth investigating.

Investigating: So you jump in. Clicking around, exploring the machine data. Pulling in additional evidence. Whatever it might be, you're spelunking now! There's no guidance for this issue since it's not a known root cause, so you are flexing your ninja skills and doing your best SPL. Eventually, you'll discover the root cause and the specific symptoms that correlate with the issue. You know that you can save that SPL into alert, or some type of monitoring .

Monitoring: In parallel to getting a fix going, you can craft your SPL into a clever search to notify you when the symptoms occur. With your scheduled search now in place, you can rest assured that should the issue present itself again, you'll be alerted. And when that happens, you have some instructions on what action to take so you can be attacking!

Attacking: You've created an actionable alert or even a scripted or automated response. Either way, when it's triggered, you're attacking. Now that you know the issue is occurring AND what the cause is, you can work to resolve it.

Now let's say the same thing, but oriented by feature:

Dashboard and glass table: Great for exposing confluence of symptoms you never previously considered.
SPL: Great for investigating
Scheduled searches and actionable alerts: Great for monitoring for known symptoms

Applying Concepts

If someone is watching a dashboard for known symptoms, try a scheduled search. If there are alerts that are informational, try using a dashboard. If you're sharing SPL, save it as a report. And lastly, reserve your SPL for forensics.

These are not ultimatums, but rather practices for which you can sanity check your approach and align features with your current goal.

I'll close with a song that comes to mind, The Splunker by Kinnie Rojyrz:

"You got to know when to search 'em,
Know when to alert 'em,
Know when to dashboard,
And know when to run."

Burch

Burch is what happens when you mix a passion for technology with a love for performing comedy. If you find a Burch in the wild, engage lovingly with discussions of Splunk Best Practices and your hardest SPL challenges.

Tips & Tricks 1 Min Read

Another Update to Keyword App

What to expect and get excited about from the latest version of Splunk’s Keyword app defined by Principal Systems Engineer Nimish Doshi.

Tips & Tricks 4 Min Read

Getting Microsoft Azure Data into Splunk

An overview of how Microsoft makes Microsoft Azure data available, how to access the data, and out-of-the-box Splunk Add-Ons that can consume this data.

Tips & Tricks 3 Min Read

Splunking DNS Using Splunk Stream – AKA, The Easy Way

NS is one of the most powerful data sources to ingest into Splunk for analytics, security or IT operations use cases or business operations insights.

About Splunk

The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.

Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.

Learn more about Splunk

Alerts and Dashboards and Searching, Oh My!

Incident Life-Cycle

Splunk Features by Quadrant

Applying Concepts

Related Articles

Another Update to Keyword App

Getting Microsoft Azure Data into Splunk

Splunking DNS Using Splunk Stream – AKA, The Easy Way

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram