Documentation: 3.4.1
Print Version Contents
This page last updated: 12/31/08 02:12pm

Files and directories

Point Splunk at a file or a directory. If you specify a directory, Splunk consumes everything in the directory. Splunk has two different file input processors: monitor and batch. For the most part, use monitor to input all your data sources from files and directories. The only time you should use batch is to load a large archive of historical files. Read on for more specifics.

Monitor

Specify a path to a file or directory and Splunk's monitor processor consumes any new input. You can also specify a mounted or shared directory, including network filesystems, as long as the Splunk server can see the directory. If the specified directory contains subdirectories, Splunk recursively examines them for new files.

Splunk only checks for files and directories each time the Splunk server starts/restarts, so be sure to add new inputs when they become available if you don't want to restart the server. If you want Splunk to find potential new inputs automatically, use crawl.

When using monitor:

  • Files can be opened or closed for writing. Splunk consumes files even if they're still being written to by the operating system.
  • Files or directories can be included or excluded via whitelists and blacklists.
  • Upon restart, Splunk continues processing files where it left off.
  • Splunk uncompresses archive files before it indexes them. It can handle the following common archive file types: .tar, .gz, .bz2, .tar.bz2 , and .zip.
  • Splunk detects log file rotation and does not process renamed files it has already indexed (with the exception of .tar and .gz archives; for more information see "Log file rotation" in this manual).
  • The entire path dir/filename must not exceed 1024 characters.
  • Set the sourcetype for directories to Automatic. If the directory contains multiple files of different formats, do not set a value for the source type manually. Manually setting a source type forces a single source type for all files in that directory.
  • Removing an input does not stop files being indexed. Rather, it stops files from being checked again, but all the initial content will be indexed. To stop all in-process data, you must restart the Splunk server.

Note: You cannot currently use both monitor and file system change monitor to follow the same directory or file. If you want to see changes in a directory, use file system change monitor. If you want to index new events in a directory, use monitor.

Batch

Use the batch processor at the CLI or in inputs.conf to load files once and destructively. By default, Splunk's batch processor is located in $SPLUNK_HOME/var/spool/splunk. If you move a file into this directory, Splunk indexes it and deletes it.

Note: Batch is most useful for loading in historical data, such as large archives of files. For best practices on loading file archives, see "How to index different sized archives".

Splunk Web

Add inputs from files and directories via Splunk Web.

1. Click Admin in the upper right-hand corner of Splunk Web.

2. Then click Data Inputs.

3. Pick files and directories.

4. Click New Input to add an input.

5. Under Data access, pick Monitor a directory.

You can also:

  • Upload a local file
    • Upload a file from your local machine into Splunk.
  • Index a file on the Splunk server
    • Copy a file on the server into Splunk via the batch directory.

6. Specify the pathname to the file or directory. If you select Upload, use the Browse... button.
To monitor a shared network drive, enter the following: <myhost><mypath> (or \\<myhost>\<mypath> on Windows). Make sure your Splunk server can see the mounted drive.

7. Under the Host heading, select the host name. You have several choices if you are using Monitor or Batch methods. Learn more about setting host value.
Note: Host only sets the host field in Splunk. It does not direct Splunk to look on a specific host on your network.

8. Now set the Source Type. Source type is a default field added to events. Source type is used to determine processing characteristics such as timestamps and event boundaries. Learn more about source type.

9. After specifying the source, host, and source type, click Submit.

CLI

Monitor files and directories via Splunk's Command Line Interface (CLI). To use Splunk's CLI, navigate to the $SPLUNK_HOME/bin/ directory and use the ./splunk command from the UNIX or Windows command prompt. Or add Splunk to your path and use the splunk command.

If you get stuck, Splunk's CLI has built-in help. Access the main CLI help by typing splunk help. Individual commands have their own help pages as well -- type splunk help <command>.

The following commands are available for input configuration via the CLI:

Command Command syntax Action
add add monitor $SOURCE [-parameter value] ... Add inputs from $SOURCE.
edit edit monitor $SOURCE [-parameter value] ... Edit a previously added input for $SOURCE.
remove remove monitor $SOURCE Remove a previously added $SOURCE.
list list monitor List the currently configured monitor.
spool spool source Copy a file into Splunk via the sinkhole directory.

Change the configuration of each data input type by setting additional parameters. Parameters are set via the syntax: -parameter value.

Note: You can only set one -hostname, -hostregex or -hostsegmentnum per command.

Required parameters
source Path to the file or directory to monitor for new input.

Optional parameters
sourcetype Specify a sourcetype field value for events from the input source.
index Specify the destination index for events from the input source.
hostname Specify a host name to set as the host field value for events from the input source.
hostregex Specify a regular expression on the source file path to set as the host field value for events from the input source.
hostsegmentnum Set the number of segments of the source file path to set as the host field value for events from the input source.
follow-only (T | F) True or False. Default False. When set to True, Splunk will read from the end of the source (like the "tail -f" Unix command).

Example

The following example shows how to monitor files in /var/log/:

Add /var/log/ as a data input:

./splunk add monitor /var/log/

Inputs.conf

To add an input, add a stanza for it to inputs.conf in $SPLUNK_HOME/etc/system/local/, or your own custom application directory in $SPLUNK_HOME/etc/apps/. If you have not worked with Splunk's configuration files before, read how configuration files work before you begin.

You can set any number of attributes and values following an input type. If you do not specify a value for one or more attributes, Splunk uses the defaults that are preset in $SPLUNK_HOME/etc/system/default/ (noted below).

Monitor

[monitor://<path>]
<attrbute1> = <val1>
<attrbute2> = <val2>
...

This type of input stanza (monitor) directs Splunk to watch all files in the <path> (or just <path> itself if it represents a single file). You must specify the input type and then the path, so put three slashes in your path if you're starting at root. You can use wildcards for the path; see below.

Note: To ensure new events are indexed when you copy over an existing file with new contents, set CHECK_METHOD = modtime in props.conf for the source. This checks the modtime of the file and re-indexes when it changes. Note that the entire file is indexed, which can result in duplicate events.

host = <string>

  • Set the host value of your input to a static value.
  • host= is automatically prepended to the value when this shortcut is used.
  • Defaults to the IP address of fully qualified domain name of the host where the data originated.
  • For more information about the host field, see the host section.

index = <string>

  • Set the index where events from this input will be stored.
  • index= is automatically prepended to the value when this shortcut is used.
  • Defaults to main (or whatever you have set as your default index).
  • For more information about the index field, see the data management section.

sourcetype = <string>

  • Set the sourcetype name of events from this input.
  • sourcetype= is automatically prepended to the value when this shortcut is used.
  • Splunk automatically picks a source type based on various aspects of your data. There is no hard-coded default.
  • For more information about the sourcetype field, see the source type section.

source = <string>

  • Set the source name of events from this input.
  • Defaults to the file path.
  • source= is automatically prepended to the value when this shortcut is used.

queue = <string> (parsingQueue, indexQueue, etc)

  • Specify where the input processor should deposit the events that it reads.
  • Can be any valid, existing queue in the pipeline.
  • Defaults to parsingQueue.

host_regex = <regular expression>

  • If specified, the regex extracts host from the filename of each input.
  • Specifically, the first group of the regex is used as the host.
  • Defaults to the default host= attribute if the regex fails to match.

host_segment = <integer>

  • If specified, the '/' separated segment of the path is set as host.
  • Defaults to the default host:: attribute if the value is not an integer, or is less than 1.

crcSalt = <string>

  • If set, this string is added to the CRC.
  • Use this setting to force Splunk to consume files that have matching CRCs.
  • If set to crcSalt = <source>, then the full source path is added to the CRC.

followTail = 0|1

  • If set to 1, monitoring begins at the end of the file (like tail -f).
  • This only applies to files the first time they are picked up.
  • After that, Splunk's internal file position records keep track of the file.

_whitelist = <regular expression>

  • If set, files from this path are monitored only if they match the specified regex.

_blacklist = <regular expression>

  • If set, files from this path are NOT monitored if they match the specified regex.

Wildcards

You can use wildcards to specify your input path for monitored input. Use ... for paths and * for files.

  • ... recurses through directories until the match is met. This means that /foo/.../bar will match foo/bar, foo/1/bar, foo/1/2/bar, etc. but only if bar is a file.
    • To recurse through a subdirectory, use another .... For example /foo/.../bar/....
  • * matches anything in that specific path segment. It cannot be used inside of a directory path; it must be used in the last segment of the path. For example /foo/*.log matches /foo/bar.log but not /foo/bar.txt or /foo/bar/test.log.
  • Combine * and ... for more specific matches:
    • foo/.../bar/* matches any file in the bar directory within the specified path.

Note: In Windows, you must use two backslashes \\ to escape wildcards. Regexes with backslashes in them are not currently supported for _whitelist and _blacklist in Windows.

Specifying wildcards results in an implicit _whitelist created for that stanza. The longest fully qualified path is used as the monitor stanza, and the wildcards are translated into regular expressions using the following map:

wildcard regex meaning
* [^/]* anything but /
... .* anything (greedy)
. \. literal .

For example, if you specify

[monitor:///foo/bar*.log]

Splunk translates this into
[monitor:///foo/]
_whitelist = bar[^/]*\.log

As a consequence, you can't have multiple stanzas with wildcards for files in the same directory.

For example:

[monitor:///foo/bar_baz*]
[monitor:///foo/bar_qux*]

This results in overlapping stanzas indexing the directory /foo/. Splunk takes the first one, so only files starting with /foo/bar_baz will be indexed. To include both sources, manually specify a _whitelist using regular expression syntax for "or":
[monitor:///foo]
_whitelist = (bar_baz[^/]*|bar_qux[^/]*)

Note: To set any additional attributes (such as sourcetype) for multiple whitelisted/blacklisted inputs that may have different attributes, use props.conf.

Examples

To load anything in /apache/foo/logs or /apache/bar/logs, etc.

[monitor:///apache/.../logs]

To load anything in /apache/ that ends in .log.
[monitor:///apache/*.log]

Batch

[batch://<path>]
move_policy = sinkhole
<attrbute1> = <val1>
<attrbute2> = <val2>
...

Use batch to set up a one time, destructive input of data from a source. For continuous, non-destructive inputs, use monitor.
Note: You must set move_policy = sinkhole. This loads the file destructively. Do not use this input type for files you do not want to consume destructively.

host = <string>

  • Set the host value of your input to a static value.
  • host= is automatically prepended to the value when this shortcut is used.
  • Defaults to the IP address of fully qualified domain name of the host where the data originated.
  • For more information about the host field, see the host section.

index = <string>

  • Set the index where events from this input will be stored.
  • index= is automatically prepended to the value when this shortcut is used.
  • Defaults to main (or whatever you have set as your default index).
  • For more information about the index field, see the data management section.

sourcetype = <string>

  • Set the sourcetype name of events from this input.
  • sourcetype= is automatically prepended to the value when this shortcut is used.
  • Splunk automatically picks a source type based on various aspects of your data. There is no hard-coded default.
  • For more information about the sourcetype field, see the source type section.

source = <string>

  • Set the source name of events from this input.
  • Defaults to the file path.
  • source= is automatically prepended to the value when this shortcut is used.

queue = <string> (parsingQueue, indexQueue, etc)

  • Specify where the input processor should deposit the events that it reads.
  • Can be any valid, existing queue in the pipeline.
  • Defaults to parsingQueue.

host_regex = <regular expression>

  • If specified, the regex extracts host from the filename of each input.
  • Specifically, the first group of the regex is used as the host.
  • Defaults to the default host= attribute if the regex fails to match.

host_segment = <integer>

  • If specified, the '/' separated segment of the path is set as host.
  • Defaults to the default host:: attribute if the value is not an integer, or is less than 1.

Note: source = <string> and <KEY> = <string> are not used by batch.

Example

This example batch loads all files from the directory /system/flight815/.

[batch://system/flight815/*]
move_policy = sinkhole

Previous: How input configuration works    |    Next: Network ports

Comments

  1. @vivekpara: thank you for your comment/suggestion. please send an email to support@splunk.com requesting this change.

  2. I'm new to Splunk, but it seems really bad to call a process that processes and then deletes a file a "batch" job. Shouldn't this be referred to as an "Archive" request. The common usage of "batch" to refer to a "scripted" job seems problematic at best when naming this process. Even "sinkhole" is better than "batch".

Log in to comment.