Splunk is really good at knowing what has been read and what is new when dealing with machine data that is on disk (using the fishbucket). However, there is a lot of machine data that does not exist on disk. Some examples of this type of data are 3rd party APIs or in-memory data. In order to get to this data, we often use scripted or modular inputs.
When dealing with scripted or modular inputs, it is important to only get new events or information since the last time the input ran. You don’t want to keep indexing the same events over and over because this will cause index bloat and performance issues. So, you need a way of knowing what you have already read and where to pick up. There are a couple of ways of doing this and where your input is running often dictates which method you should use.
Running Inputs from a Splunk Indexer or Search Head
When querying a 3rd party API (especially SaaS APIs), it may make sense to run the input on a search head or indexer. There are less moving parts involved and there are some built-in Splunk script methods you can use.
Step 1 – Create a default .conf file
The first thing we need to do is create a .conf file to store our parameters. This file can optionally have any defaults defined. I called my file my_app.conf, but you can call it whatever you want.
$SPLUNK_HOME/etc/apps/<your_app>/default/my_app.conf
Here are the contents of my_app.conf:
[API_data]
last_record_read =
Note: after the script runs, updates will be made in the following location (notice it is in local instead of default):
$SPLUNK_HOME/etc/apps/<your_app>/local/my_app.conf
Step 2 – Create a script to query data
The script should be in $SPLUNK_HOME/etc/apps//bin/scripts
Here is an example in Python using Splunk Entities:
import splunk.entity as en
import splunk, sys, re, time, logging, os
# Constants
APP_NAME = "my_app"
CONF_FILE = "my_app"
STANZA_NAME = "API_data"
# Set up script logging
def getExceptionLogger():
logger = logging.getLogger(APP_NAME)
SPLUNK_HOME = os.environ['SPLUNK_HOME']
LOGGING_DEFAULT_CONFIG_FILE = os.path.join(SPLUNK_HOME, 'etc', 'log.cfg')
LOGGING_LOCAL_CONFIG_FILE = os.path.join(SPLUNK_HOME, 'etc', 'log-local.cfg')
LOGGING_STANZA_NAME = 'python'
LOGGING_FILE_NAME = APP_NAME + '.log'
BASE_LOG_PATH = os.path.join('var', 'log', 'splunk')
LOGGING_FORMAT = "%(asctime)s %(levelname)-s\t%(module)s:%(lineno)d - %(message)s"
splunk_log_handler = logging.handlers.RotatingFileHandler(os.path.join(SPLUNK_HOME, BASE_LOG_PATH, LOGGING_FILE_NAME), mode='a')
splunk_log_handler.setFormatter(logging.Formatter(LOGGING_FORMAT))
logger.addHandler(splunk_log_handler)
splunk.setupSplunkLogger(logger, LOGGING_DEFAULT_CONFIG_FILE, LOGGING_LOCAL_CONFIG_FILE, LOGGING_STANZA_NAME)
return logger
def updateStanza(key,value):
conf_stanza[key] = value
en.setEntity(conf_stanza,sessionKey=sessionKey)
def run_script():
try:
# Get the last record read from the conf file
last_record_read = conf_stanza["last_record_read"]
# Perform your connection to the API here.
# You can use the variable last_record_read to pass to the API call.
# Write to stdout (i.e. print) the records returned from the API however you want them to show up in the index.
# Once you have written your records to stdout, update the conf file using a time stamp or value from the API call.
# In this case, we are just making up a value, but this would normally be returned from the API.
last_record_read_from_API = 400
updateStanza(key="last_record_read", value=last_record_read_from_API)
except IOError, err:
logger.error('ERROR - %s' % str(err))
pass
conf_stanza = None
if __name__ == '__main__':
logger = getExceptionLogger()
logger.info("Script started")
# Get the sessionKey from splunkd
# Note: inputs.conf shoud specify passAuth = splunk-system-user
sk = sys.stdin.readline().strip()
sessionKey = re.sub(r'sessionKey=', "", sk)
try:
# Get the stanza key/value pairs
conf_stanza = en.getEntity('configs/conf-' + CONF_FILE, STANZA_NAME, namespace=APP_NAME, owner='nobody', sessionKey=sessionKey)
run_script()
except IOError, err:
logger.error('ERROR - %s' % str(err))
pass
Step 3 – Add the script to inputs.conf
Once you have your script written, the script needs to be added to inputs.conf. Here is an excerpt:
[script://./bin/scripts/test.py]
disabled = 0
interval = 300
passAuth = splunk-system-user
The passAuth part is very important in order to obtain a session key in the script.
Running Inputs on Universal Forwarders
When you have a lot of universal forwarders running inputs and forwarding data to indexers, it is impractical to keep up with each and every forwarder’s position from a search head or indexer. Therefore, it is better to have the forwarder itself keep a position file to mark where it is in the running lifecycle. Unfortunately, Python does not ship with Universal Forwarders, so the Entity method used above will not work. You are free to implement this type of position placeholder using any scripting language the OS understands. For an example using PowerShell, refer to the blog post about Measuring Windows Group Policy Logon Performance.
Specifically, here is the code that utilized a position file: