TIPS & TRICKS

Exporting Large Results Sets to CSV

You want to get data out of Splunk. So you do the search you want and create the table you want in the search app. The results are hundreds of thousands of rows, which is good. So you click on the Export button and download the results to CSV. When you open the file, you see 50,000 rows. Is this a common problem? Not really. It’s a large enough result set that most people want to keep it in Splunk for analysis. However, there are times when such a large export is required. You really don’t want to log on to the Splunk server to get it either. So how do you progress?

I recently bumped into this problem myself while working on a new app. When developing a new app, we don’t work on production data. That’s a bad idea. However, we have an event generator that allows us to replay log files into our test environment so that we have a large data set to work with. In my case, this was a ULS log set from a SharePoint farm. One days log can be several hundred megabytes of data. However, the production data is in California and I was in Australia.

Fortunately, we do have tools available to do this. One of these is the RESTful interface to the backend of the search head. This is great for developers (and if you are one of these developers, then head over to dev.splunk.com for information on our SDK interfaces for .NET, Java, Python, Ruby and JavaScript). It’s also great for these larger export jobs and automation.

My tool of choice for this is curl. This is standard issue on Linux systems, but there are downloads available for Windows as well, which is my platform of choice. (Note, if you are doing this in PowerShell, you will need to remove the alias for curl, which uses Invoke-WebMethod instead. The Invoke-WebMethod is not the same thing at all!) So how do we do this? First off, figure out your search. In my case, I want a particular sourcetype for one day – let’s say 2 days ago. So here is my search:

index=mssharepoint sourcetype=MSSharePoint:2013:ULSAudit host=SP-APP01 | table index,host,source,sourcetype,_raw

Note that I am explicitly setting the fields I want and putting the results into a table. I want to store the results of this search into a file called sp-app01.csv. The REST endpoint we are going to use is the /search/jobs/export endpoint, and you use it like this:

curl -k -u admin:mypassword https://myhost:8089/services/search/jobs/export --data-urlencode search='search index=mssharepoint sourcetype=MSSharePoint:2013:ULSAudit host=SP-APP01 | table index,host,source,sourcetype,_raw' -d output_mode=csv -d earliest_time='-2d@d' -d latest_time='-1d@d' -o sp-app01.csv

If you leave off the -o option you will get the output streamed to your console – given the amount of data you are grabbing, this is not optimal. If you include the -o, then you get a nicely formatted display of progress.

  % Total    % Received % Xferd    Time    Time     Time  Current
                                   Total   Spent    Left  Speed
100  369M    0  369M    0   187 --:--:--  0:11:42 --:--:--     0

The numbers update during the process telling you how much has been downloaded and the speed at which the data is coming across. But 12 minutes later, I have a 369Mb file with well over 500,000 lines of data. However, it’s all formatted in the CSV file. If you are doing this for event generation, then this isn’t the end of the line. You need to remove the header, reverse the lines and then add the header back in. I find the head, tail and tac utilities that are provided on Linux systems useful here:

head -1 sp-app01.csv > header.txt
tail -n +2 sp-app01.csv > sp-app01.xxx
tac sp-app01.xxx > sp-app01.yyy
cat header.txt sp-app01.yyy > sp-app01.csv

If you don’t have a Linux system handy, then you can do the same things using the GNU tools for Windows. Now that file is ready for event generation.

Use this facility when you want to export a large amount of data from Splunk across the network.

Splunk
Posted by

Splunk

Join the Discussion