TIPS & TRICKS

Alerts and Dashboards and Searching, Oh My!

"So you're telling me you have an employee watching a dashboard at all times? How is that not expensive?"

"So you get these emails from your alerts, but there's no action to take when you get them? How is that not spam and causing you to ignore them all?"

"So everyone emails these searches to each other to run if you want to know if the system is stable? How is that not prone to human error?"

I've come across all of these quirks, and I get why. When you're a member of a technical team, you often do odd things to keep the system up; things that once worked, but your silly human brain makes you compelled to repeat—even if it's arguably irrational. Sometimes, you're in so deep you don't notice this silliness until someone else comes along and points it out to you. If you're having trouble relating, recall the Band-Aid™ cronjobs (or Scheduled Tasks for our Windows cousins) that you've set up.

Having a stronger practice around when and why you use Splunk's searches, alerts, and dashboards can make your Splunk usage dramatically more effective.

Incident Life-Cycle

"...there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don't know we don't know. ..." - Donald Rumsfeld, 2012

Through my collaborations with Splunk users, we've come to recognize the circumstances that make certain Splunk product features the best practice at a given time for a given goal. Furthermore, such goals (when chained together) represent the life-cycle of an incident: symptom → root cause investigation → permanent fix; or, as we know things really work at an enterprise: symptom → root cause investigation → temporary workaround & monitoring → permanent fix. That "temporary workaround & monitoring" phase may be restarting a server when a known confluence of symptoms occur. It's a necessary evil given the reality that at an enterprise, there are change windows, approvals, red tape, and political polish to get any permanent fix created and applied.

As a visual thinker, I realized we had a 2x2 matrix showing root cause in relation to an issue occurring. Kind of like a Johari Window for incidents!

 

Root Cause
Unknown

Root Cause
Known

Issue Exists
Unaware

Q0

Q2

Issue Exists
Aware

Q1

Q3

This matrix did a great job of capturing this life-cycle! All is quiet (Q0) until you learn of some odd behavior. When these symptoms occur, you don't know the root cause, but you're now aware that an issue exists (see Q1) so you start an investigation to uncover the root cause (still Q1). Once that is known, you enter a cycle of checking if the symptoms present themselves (Q2), and if so, applying a fix (Q3).

Splunk Features by Quadrant

So how might Splunk help here? Let's start by labeling each quadrant in accordance to what we've outlined thus far:

 

Root Cause
Unknown

Root Cause
Known

Issue Exists
Unaware

listening

monitoring

Issue Exists
Aware

investigating

attacking

By recognizing what we know and don't know, we can identify what action to take in each phase.

Listening: Think of this as status quo. Business as usual. While you go about your every day activities, you may learn about a confluence of symptoms that are compelling. This discovery could occur as formally as an incident ticket that lands at your desk or as subtly as you merely noticing patterns or behaviors that, while you could not have anticipated, you know just aren't right. Think of the latter as noticing a Splunk dashboard or glass table that seems abnormal. The point being that a dashboard or glass table are great for exposing the simultaneous patterns and behaviors of your symptom that individually may be innocuous, but together (with your innate technical background and knowledge of your systems) tells you something is worth investigating.

Investigating: So you jump in. Clicking around, exploring the machine data. Pulling in additional evidence. Whatever it might be, you're spelunking now! There's no guidance for this issue since it's not a known root cause, so you are flexing your ninja skills and doing your best SPL. Eventually, you'll discover the root cause and the specific symptoms that correlate with the issue. You know that you can save that SPL into alert, or some type of monitoring .

Monitoring: In parallel to getting a fix going, you can craft your SPL into a clever search to notify you when the symptoms occur. With your scheduled search now in place, you can rest assured that should the issue present itself again, you'll be alerted. And when that happens, you have some instructions on what action to take so you can be attacking!

Attacking: You've created an actionable alert or even a scripted or automated response. Either way, when it's triggered, you're attacking. Now that you know the issue is occurring AND what the cause is, you can work to resolve it.

Now let's say the same thing, but oriented by feature:

Applying Concepts

If someone is watching a dashboard for known symptoms, try a scheduled search. If there are alerts that are informational, try using a dashboard. If you're sharing SPL, save it as a report. And lastly, reserve your SPL for forensics.

These are not ultimatums, but rather practices for which you can sanity check your approach and align features with your current goal.

I'll close with a song that comes to mind, The Splunker by Kinnie Rojyrz:

"You got to know when to search 'em,
Know when to alert 'em,
Know when to dashboard,
And know when to run."

Burch Simon
Posted by

Burch Simon

Burch is what happens when you mix a passion for technology with a love for performing comedy. If you find a Burch in the wild, engage lovingly with discussions of Splunk Best Practices and your hardest SPL challenges.

 

Join the Discussion