TIPS & TRICKS TIPS & TRICKS

Maintaining State of the Union

Fellow Splunkers,

Well, it’s almost that time of year again already – the State of the Union address is scheduled for January 25th, 2011.

My predictions for the speech are as follows:

  • Things are getting better  :]
  • There are still many challenges to overcome  :[
  • Inspirational story 1, with subject of said story in attendance to the left of Mrs. President  ;_;
  • Inspirational story 2, with subject of said story in attendance to the right of Mrs. President  ;_;
  • Wrap it up, B(arack) [comedycentral.com] :&

However, I would actually like to discuss a different kind of “state” – one that is more directly related to Splunk’s built-in capabilities (though I haven’t given up on my ‘Anti-unemployment’ or ‘Budget Balancing’ apps – keep checking Splunkbase!).

The reason for the season

This inspiration for this post was an email from David Carasso, Splunk’s self-appointed ‘Chief Mind‘ (who should probably blog a little more).  Carasso has been poring over all the Q&A on Splunk Answers in an attempt not only to improve quality but also to determine what the most common search-related questions and answers are.

In his aforementioned email to me, he pointed out several flaws in my answer to this Splunk Answers question about determining the state of firewall connections using only the firewall logs and the Splunk search language.  Carasso could have eased the blow via a compliment sandwich, but apparently we don’t stock kosher sandwiches in the Splunk office’s kitchen.

While I conceded some of his points and continued to argue others, one thing was abundantly clear:

The best way to maintain state in Splunk is to use lookups

Why is that the case? Consider that some connections may have arbitrarily long durations that may exceed a normal search’s ‘earliest time’.

For example, a business-to-business connection or a remote employee keeping the connection open to ensure that a key operation completes successfully may have connections open for days, weeks, or perhaps months.  Using just the search language, we would need to set ‘earliest time’ so far back to catch these connections that the search’s overall performance would almost certainly suffer.

Let’s set up a lookup using the example data from the Splunk Answers question above, repeated below, to enable us to find connections on the firewall that have been opened but have not yet been closed:

Nov 4 17:42:38 192.168.150.1 id=firewall sn=xxxxxxxxx time="2010-11-04 17:42:42" fw=192.168.254.5 pri=6 c=1024 m=537 msg="Connection Closed" n=0 src=192.168.150.93:1637:X0 dst=192.168.100.10:4440:X2 proto=tcp/4440 sent=2505 rcvd=677 host=192.168.150.1
Nov 4 17:41:53 192.168.150.1 id=firewall sn=xxxxxxxxx time="2010-11-04 17:41:56" fw=192.168.254.5 pri=6 c=262144 m=98 msg="Connection Opened" n=0 src=192.168.150.93:1637:X0 dst=192.168.100.10:4440:X2 proto=tcp/4440

We are assuming that:

  • we are logged in to Splunk as a user that is a member of the ‘Power’ role (or have equivalent or higher capabilities)
  • a connection is defined as a unique source IP connecting to a unique destination IP on a unique protocol/port
  • msg=”Connection Open” authoritatively signifies the first event of a connection and is logged for every connection
  • msg=”Connection Closed” authoritatively signifies the last event of a connection and is logged for every connection

My intervals are lagging

First, let’s decide on the appropriate interval to search for firewall connections as well as the lag that we want to allow for in our infrastructure.In Splunk 4.2, real-time alerting might keep us from having to decide these things, but in Splunk 4.1.6 we need to decide how frequently to run the saved search that is looking for new connection information, as well as how much time to lag behind real time to account for delays in the firewall getting information across the network to Splunk.

As in most cases, I like a one minute interval and one minute lag for the following reasons:

  • I want to be kept as up to date as possible on what is going on (one minute interval)
  • I don’t mind waiting an extra minute to better assure completeness and accuracy (one minute lag)

How does this translate into Splunk settings?

  • To achieve a one minute interval, I will schedule a search to run every minute.
  • To allow for one minute of lag, I will offset the ‘earliest time’ and ‘latest time’ of my scheduled search one minute in to the past.
    • In the parlance of the Splunk search language: earliest=-2m latest=-1m

Make a pletty pretty search

Now that we’ve decided upon an appropriate interval and lag, we need to come up with a search that will give us the appropriate tabular output that we can store in our lookup.

The search I proposed as the solution on Splunk Answers is a good start, but we need to modify it a bit just to identify when a connection is opened:

sourcetype=sonicwall msg="Connection Opened" earliest=-2m latest=-1m
| stats count by src dst proto _time
| fields src dst proto _time

This will yield results as follows:

_time,src,dst,"proto"
1294769088,192.168.1.123,216.52.242.86,"tcp/443"
1294769077,192.168.1.20,69.63.189.16,"tcp/80"

CSV bizness, bludclart

Sweet!  Now, we need to create a lookup in Splunk and then modify and schedule the above search to continuously append to the lookup.

First, let’s define our lookup in Splunk.  Assuming I am in the ‘search‘ app, I would use my favorite text editor to edit $SPLUNK_HOME/etc/apps/search/local/transforms.conf as follows:

[firewall_open_connections]
filename=firewall_open_connections.csv

Go ahead and restart Splunk at this point.

Next, let’s modify the above search to output the results to a CSV lookup file in the current app that I am working in.  I’m just going to add one more search command – outputlookup:

sourcetype=sonicwall msg="Connection Opened" earliest=-2m latest=-1m
| stats count by src dst proto _time
| fields src dst proto _time
| outputlookup firewall_open_connections.csv

Assuming I am in the ‘search‘ app, running the above search writes the results to $SPLUNK_HOME/etc/apps/search/lookups/firewall_open_connections.csv.

This is perfect for our initial lookup, but if you try to run the search again, it will blow away the existing lookup!  How do we keep the lookup up to date?

Upup my lookdate

There are two things that we want to accomplish when we update our lookup:

  • append new open connections to the lookup
  • purge open connections from the lookup that have been subsequently closed

To append the new open connections to the existing lookup, we need to invoke the inputlookup command:

sourcetype=sonicwall msg="Connection Opened" earliest=-2m latest=-1m
| stats first(_time) as _time by src dst proto
| fields src dst proto _time
| inputlookup append=t firewall_open_connections
| outputlookup firewall_open_connections

This tells Splunk:

  • find the “Connection Opened” events that occurred between two and one minutes ago
  • use stats to effectively de-duplicate unique each combination of src, dst, proto and the most recent _time into a tabular format
  • use fields to retain only the src, dst, proto, and _time fields
  • use inputlookup to append the existing firewall_open_connections.csv lookup information
  • use outputlookup output all these results to firewall_open_connections.csv

Now, we need to figure out how to purge open connections from the lookup that have been closed.  This requires us to expand the scope of our search to include “Connection Closed” events so that we can match them up with the corresponding “Connection Opened” events in our lookup.

sourcetype=sonicwall msg="Connection Opened" OR msg="Connection Closed" earliest=-2m latest=-1m
| stats first(_time) as _time by src dst proto msg
| inputlookup append=t firewall_open_connections
| fillnull msg value="Connection Opened"
| eval closed=if(msg="Connection Closed",_time,"1")
| eval open=if(msg="Connection Opened",_time,"1")
| stats first(open) as open first(closed) as closed by src dst proto
| where open > closed
| rename open as _time
| fields src dst proto _time
| outputlookup firewall_open_connections

Whoa – what happened there?  Well, we told Splunk:

  • find both the “Connection Opened” and “Connection Closed” events that occurred between two and one minutes ago
  • use stats to effectively de-duplicate unique combinations of src, dst, proto, msg, and the most recent _time into a tabular format
  • use inputlookup to append the rows of the existing ‘firewall_open_connections’ lookup to the result set
  • use the fillnull command to insert the field “msg” with the value “Connection Opened” in events that have no msg field (i.e. the rows we just appended from the lookup)
  • use the eval command to create two new fields:
    • create a new field called “open”; if msg=”Connection Opened”, set the value as the epoch timestamp of the event; otherwise, set the value to “1”
    • create a new field called “closed”; if msg=”Connection Closed”, set the value as the epoch timestamp of the event; otherwise, set the value to “1”
  • use stats to find the most recent ‘open’ and ‘closed’ for each unique combination of src, dst, and proto
  • use where to filter out rows where ‘closed’ is greater than ‘open’ (i.e. rows that represent closed connections – pruning complete!)
  • use rename to set the value of ‘_time’ to the value of ‘open’ so that when we write results to our lookup the ‘_time’ field exists and has the proper values
  • use fields to retain only the src, dst, proto, and _time fields
  • use outputlookup output all these results to ‘firewall_open_connections’

The 37th Chamber of Splunk-Fu

Now we have defined the lookup and crafted the lookup-updating search – no mean feat, to be sure!

However, we still need to schedule the lookup-updating search and then figure out how to use the lookup to identify open connections.

To schedule the search:

  • go to Manager > Searches and Reports and click on ‘New
  • give the search a ‘Name‘ that satisfies your organization’s naming convention
    • for example, here at Splunk I would name the search ‘__update_lookup_firewall_open_connections
  • input the above search, making sure you omit the ‘earliest=-2m latest=-1m’
  • input ‘-2m‘ under ‘Start time‘ and ‘-1m‘ under ‘Finish time
  • check the box next to ‘Schedule this search
  • under ‘Run every‘, choose ‘minute
  • click ‘Save

After you do this, keep an eye on the search to ensure that performance is acceptable (i.e. that it completes in less than 60 seconds) by navigating to Manager > Searches and Reports and clicking on the ‘View Recent‘ link to the right of the search.

Finally, we want to make use of this lookup to find open connections, right?  That was the whole point, after all.  Try this search:

| inputlookup firewall_open_connections | sort -_time

Now you have a tabular listing of all open connections in a time-descending order.  Are you feeling presidential yet?  If not, save the search, use it in a dashboard or as an alert, and try to keep the following in mind:

  • Things are getting better (with Splunk)
  • There are still many challenges to overcome (with Splunk)
  • Cool story, bro!

With all that said, consider this post wrapped up.

Alex Raitz
Posted by Alex Raitz

Join the Discussion