Documentation: 3.1
Print Version Contents
This page last updated: 11/12/07 03:11pm

How input configuration works

Splunk's data inputs are specified via inputs.conf. In most cases, you will not need to modify these files as the required information is written when you configure data inputs through Splunk CLI or SplunkWeb. For more granularity, however, you may wish to configure settings via inputs.conf. Changes made via SplunkWeb or the Splunk CLI are stored in $SPLUNK_HOME/etc/bundles/local/inputs.conf.

As Splunk processes files, it assigns default values for sourcetype, hostname, and index for each file. You can override these setting when you define inputs, or you can modify them later.

Data input types

Data inputs have characteristics that are independent of how the inputs are defined. A tail input type behaves the same whether you add it in Splunk CLI or SplunkWeb This section describes the purpose, behavior, and rules/restrictions of the Splunk data input types.

The next section describes the mechanics of adding the data inputs via the various Splunk interfaces and how to modify indexing settings.

Files and directories

Data inputs can come from files and directories. Data in files can be processed in live or batch mode. Live input is for active log files and is handled through Splunk's tail processor. Batch input is for closed, archived data, and batch files are handled through the Upload or Watch/Batch processors.

Tailing a file

Splunk's tail behaves like the UNIX tail command. Specify a path to a file or directory whose contents should be indexed by the Splunk server, and Splunk will watch and consume any new input. If subdirectories exist, Splunk will recursively examine them for log files. If new files appear in a tailed directory, Splunk will add them to the index.

Please note: Starting with Splunk 3.0.2, the tail input method allows you to specify the option to have tail process files like UNIX tail -f. Specifically, you have the option to have tail read the end of a file and wait for new input rather than consume the entire file and wait for new input. This option is specified in inputs.conf with the followTail attribute. A value of 1 indicates to read from the end of the file. The default is 0, or read the entire file. This option will be ignored if the file has ever been indexed by Splunk.

In addition, when tailing a file for input:

  • files can be opened or closed.
  • files or directories can be included or excluded through the use of file whitelists and blacklists.
  • if processing is discontinued for any reason, the Splunk server will continue processing from where it left off, once it restarts.
  • log file rotation can occur while Splunk is tailing a file. It will detect the rotation and will not process the renamed file again.
    • please note: log rotation does not currently work while tailing over SMB mounts.

When tailing a directory for input:

  • ensure the sourcetype is set to Automatic. If the directory contains multiple files of different format, do not set a value for the sourcetype manually. Manually setting a sourcetype forces a single sourcetype for all files in that directory, and results in unpredictable indexing behavior.

Please note: If the specified file or directory does not exist, the Splunk server will not check to see if it is created later. Splunk only checks for files and directories each time the Splunk server starts (or is restarted). So be sure to explicitly add new files as inputs when they become available if you don't want to restart the server. When tailing a file the entire path dir/filename must not exceed 1024 characters.

Batch upload and watch

Splunk has a batch processing module. It watches any specified directory on the local Splunk server's file system and then processes the entirety of any new file that appears. You can also upload archived files directly into Splunk for analysis. If necessary, Splunk will unpack and uncompress a file before indexing. Keep in mind that Splunk will need adequate disk space to uncompress these files, and that this processing can take more time than processing a live or uncompressed file.

By default, Splunk's batch processor is located in $SPLUNK_HOME/var/spool/splunk. You can set up your own watch directory as well.

Please note: This method will not keep watch on the files it has already seen, so it's not designed for live logfiles -- just rotated archive copies.

In addition, when batch uploading or watching, Splunk can:

  • delete files
    • if you are copying files to the Splunk host and have no need to keep them on the server.
  • make a copy of files
    • if you have mounted an existing log archive filesystem to your Splunk host via NFS, SMB or other network file sharing protocol.
  • use a symlink
    • if your Splunk host is also your primary central log archive so all the archive files are local, or if you are mounting your existing log archive file system to your Splunk host via SAN.

FIFO queues

A FIFO (AKA named pipe) is a queue of data maintained in a Unix host's memory. It can be accessed like a file and log messages can be written to it. When choosing the FIFO data input method consider the following:

  • FIFO queues can be a high performance method to get data into Splunk, since the system does not have the I/O burden of writing to both a file on disk and Splunk's index on disk (like the tailing method).
  • FIFO access is very fast, but FIFOs are vulnerable when there are processing disruptions because the in-memory data may be lost.
  • you do not have to worry about log file rotation and archiving because the data goes straight from the logging application into Splunk via the queue. There is nothing on disk to manage except for Splunk's index.
  • most syslog implementations can be configured to write to FIFO queues in addition to or instead of files.
  • you might be able to get other applications to write to FIFO queues instead of files by just changing a logfile name parameter from a filename to a defined FIFO queue.

Network ports

UDP and TCP ports can feed data into the Splunk Server. UDP and TCP behave differently, and these behaviors effect how data arrives for processing. When configuring network ports, please keep in mind that you cannot use ports lower than 1024 if you have not installed Splunk as root.

UDP

UDP is a best effort protocol. This means that you might not get messages if the network is clogged, or has a hiccup. You also can't be absolutely sure the messages aren't spoofed or altered in transit. UDP should be reserved for logging implementations focused on day-to-day troubleshooting rather than compliance or security.

Splunk Enterprise can read directly from the network on any UDP port. This technique is most often used to make Splunk act directly as a syslog server by reading remote syslog events on UDP port 514. However, it also can be used for any other UDP source of logging data, including SNMP.

Like all of the network streaming-based approaches, direct UDP input is higher performance than reading files from disk.

TCP

TCP is a reliable, high-performance choice for many situations, as this protocol includes checks to ensure that data has arrived safely and intact. Splunk with an Enterprise license can receive data on any TCP port, allowing Splunk to receive remote data from syslog-ng and alternative syslog implementations that use TCP for security or reliability. This feature is the foundation of Splunk's distributed data access.

Please note: If the sending process buffers data such that events are broken into multiple pieces, Splunk may interpret the parts as multiple events. This is more likely if events are being generated intermittently, as there may be long pauses (several seconds or longer) between blocks of buffered data. If you notice truncated events, try forcing the process to send events atomically.

Scripted inputs

Splunk can be configured to run an arbitrary shell command on any schedule, and then pipe the output to Splunk for processing. Examples of shell scripts that process meaningful data for Splunk to digest include:

  • vmstat, iostat, netstat, and any other network or system status commands.
  • SQL DBI
  • HTTP and HTTPS requests
  • SNMP

See Configure scripted inputs for details on how to set this up.

Indexing properties

A distinguishing characteristic of Splunk is that it can universally process any IT data, regardless of format. It automatically learns event boundaries, classifies events and sources, and finds timestamps. However, sometimes you may want to change or augment Splunk's default processing. This can be done via setting indexing properties in a props.conf file in the $SPLUNK_HOME/etc/bundles/local directory. (Read more about bundles.)

Some attributes within props.conf can be customized by defining new stanzas in other configuration files, most commonly transforms.conf, which defines regex-based rules for extracting fields, correlating events and performing other transformations. Segmenters.conf, outputs.conf and metaevents.conf can also define attribute values that can be referenced by props.conf.

Common use cases for custom indexing properties include:

Comments

No comments have been submitted.

Log in to comment.