Documentation: 3.3
Print Version Contents
This page last updated: 08/07/08 05:08pm

Configure crawl

Use crawl to search your filesystem for new data sources to add to your index. Configure one or more types of crawlers in crawl.conf to define the type of data sources to include in or exclude from your results.

Configuration

Edit crawl.conf to configure one or more crawlers that browse your data sources when you run the crawl command. Define each crawler by specifying values for each of the crawl options. Enable the crawler by adding it to crawlers_list.

Crawl logging

The crawl command produces a log of crawl activity that's stored in $SPLUNK_HOME/var/log/splunk/crawl.log. Set the logging level with the logging key in the [default] stanza.

Example:
Set the logging level of crawl to warn.

[default]
logging = warn

Enable crawlers

Enable a crawler by listing the crawler specification stanza name in the crawlers_list key of the [crawlers] stanza.

Use a comma-separated list to specify multiple crawlers.

Example:
Enable crawlers that are defined in the stanzas: [file_crawler], [port_crawler], and [db_crawler].

[crawlers]
crawlers_list = file_crawler, port_crawler, db_crawler

Define crawlers

Define a crawler by adding a definition stanza in crawl.conf. You can add additional crawler definitions by adding additional stanzas.

Example crawler stanzas in crawl.conf:

[Example_crawler_name]
....

[Another_crawler_name]
....

Add key/value pairs to crawler definition stanzas to set a crawler's behavior. The following keys are available for defining a file_crawler:

bad_directories_list Specify directories to exclude.
bad_extensions_list Specify file extensions to exclude.
bad_file_matches_list Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*).
packed_extensions_list Specify extensions of compressed files to include. Leave this empty if you don't want to add any zipped files.
collapse_threshold Specify the minimum number of files a source must have to be considered a directory.
days_sizek_pairs_list Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size.
big_dir_filecount Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify.
index Specify the name of the index to add crawled file and directory contents to.
max_badfiles_per_dir Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory.
root Specify directories for a crawler to crawl through.

Example

A simple file_crawler may look like:

[simple_file_crawler]
bad_directories_list= bin, sbin, boot, mnt, proc, tmp, temp, home, mail, .thumbnails, cache, old
bad_extensions_list= mp3, mpg, jpeg, jpg,  m4, mcp, mid
bad_file_matches_list= *example*, *makefile, core.*
packed_extensions_list= gz, tgz, tar, zip
collapse_threshold= 10
days_sizek_pairs_list= 3-0,7-1000, 30-10000
big_dir_filecount= 100
index=main
max_badfiles_per_dir=100

Previous: Windows registry input    |    Next: Scripted inputs

Comments

No comments have been submitted.

Log in to comment.