TIPS & TRICKS

Enriching Data with DB Lookups (Part 2)

Today, I’m writing as a guest blogger for Bob Fox to create part 2 of enriching data with the Splunk lookup command. Bob had already created part 1, which describes in detail with an example how to use the lookup command to enrich data from external CSV files. Today’s topic builds on the lookup command usage showing how to enrich indexed data at search time using an external database.

To begin with, it is a fact of life that some event data or log data may not reside in files, may not be broadcast on network ports, or even be imported uniquely via a scripted input. This data may, for legacy reasons, reside in a database. The often cited use case in this scenario is that the user would like to correlate some data that resides in Splunk as indexed events with similar fields that reside in a database. Even if a scripted input can be used to uniquely capture the data within the database and have it indexed within Splunk, there exists the issue of having redundant data that has been indexed twice only for the sake of heterogeneous correlation (join between a field in indexed data within Splunk and a field within data located in the database). Some people may not desire to index data once within a database and again elsewhere via extraction methods that end up taking disk space in the secondary index.

Examples of use cases where data resides within Splunk and related data resides within an external database are easily found. For instance, there may exist a security use case where an investigator is looking at events in Splunk and finds that a particular user has done something questionable. One thing the investigator may want to do is find the user’s address location and phone number that resides in a relational database. Using search within Splunk to quickly get to this database data is useful. Another example could be that a proprietary system logs all its access data including IP source addresses that are being used within a relational database, while at the same time the company has firewall data being indexed within Splunk. A correlation between the two types of data within Splunk using the IP address as the common key should be possible. With these types of correlation in mind, I’ll go over the steps for setting up an example and provide a link to download it.

First, decide what field within indexed data within Splunk is going to be used to correlate and enrich data with an outside source. In my example, I have weather data and the field that I want to use is the city field within weather data. For purposes of illustration, my data looks like this:

Jul 27 08:35:09...city=Nice...

Splunk will automatically extract the city field with the value Nice at search time. What I’m interested is finding the country location of a city using an external database for the correlation. Again, for simplicity, I’ll use a terse database table.

city country
Nice France
Cambridge UK

Now, let’s move on to the Splunk setup. You’ll need to add an entry for the lookup to your prop.conf configuration file just like in part 1. Mine looks like this:

[weather_entry]
lookup_table = countrylookup city OUTPUT country

Next, you’ll need to define what countrylookup does in your transforms.conf file. In this case, it will call an external Python program.

[countrylookup]
external_cmd = countrylookup.py city country
external_type = python
fields_list = city, country

The external command that you write, countrylookup.py, should reside in the bin directory of your application. The city and country terms next to it in the configuration file are the input field name headers used to produce a dynamic CSV table that is sent to a Python CSV standard output writer. The Python program gets its city field input via standard CSV input from Splunk, calls SQL to find the corresponding country, and produces the aggregate CSV output that contains the city with its correlated country. The complete example with instructions on set up can be downloaded from Splunkbase. My example uses the MySQL database, but you are free to change the code to use whatever database you require as long as there exists in this case a Python module to access the database. The final touch is to show you what the Splunk search looks like to get the new country field for my example.

sourcetype="weather_entry" city=Nice |xmlkv| lookup countrylookup city OUTPUT country

This will return France in a new field called country. There are a few design considerations that need to be addressed before I conclude today’s entry.

  • Use a database index in your DB on the field that is being used to correlate between Splunk and the database.
  • Have the Python program connect to a long running program (application server) that maintains a connection pool to the database to avoid having to reconnect to the database on each invocation of the lookup command.
  • If you know beforehand the number of uniquely matched events in the database will only be in the few hundreds, such as number of unique cities in my case, consider building an internal cache to avoid having to access the database for each corresponding select call. Splunk’s iplocation command does this and the source code for iplocation.py is included in your download of Splunk.

To wrap up, although we didn’t discuss the user written Python program to do the lookup in detail, the sourcecode for it is part of the Splunkbase.com download to provide you with one example on how it can be written. The Splunk distribution also ships with external_lookup.py, which has a similar structure for taking CSV input from Splunk via standard input and producing CSV standard output. I hope today’s entry is useful for these types of use cases.

Nimish Doshi
Posted by

Nimish Doshi

Nimish Doshi is a Principal Systems Engineer with Splunk and has been active on Splunkbase content creation and writing Splunk blogs entries for a number of years.

 

Join the Discussion