Preview [ Preview documentation: caution, tech writers working. ]
Print Version Contents
This page last updated: 06/24/08 10:06am

crawl

Splunk Preview introduces a new search feature, crawl, that searches your filesystem for new data sources to add to your index. Configure one or more types of crawlers in crawl.conf to define the type of data sources to include in or exclude from your results. Save this crawl search and schedule it to run regularly to update your indexes.

This topic explains how to use the crawl command, save and schedule a crawl search, and configure different crawlers.

Note: Splunk Preview currently supports one type of crawler, labeled file_crawler. As yet, you cannot define a custom crawler.

Use crawl

In Splunk Web, you can access and run the crawl command from the Splunk search bar and the Admin > Data Inputs: Crawls page.

The Splunk search bar
You can run the crawl command directly from the search bar:

| crawlSearch

If you run a crawl without arguments, Splunk searches your filesystem with the settings defined in crawl.conf. To override these default settings, specify crawl options at search time.

The Admin page
You can manage all your saved crawls from the Admin > Data Inputs: Crawls page. From this page, you can also run the default crawl search by clicking New Crawl:

| crawl | search NOT *personal*Search

After the crawl completes you can add or remove options to narrow your search.

Results of a crawl

For each item listed in your crawl results, Splunk displays whether or not it is a file, a timestamp indicating when it was last modified, its size, and its status (whether it is added or not added to your inputs). You can perform two actions on each data source: Add input and Preview file/directory.

Preview file or directory

To review the contents of the data source before adding it as an input, click Preview file or Preview directory.

A new window opens:

  • If you click Preview file on a file, Splunk returns events from the file.
  • If you click Preview directory on a directory, Splunk displays a list of the files in the directory and lets you drill-down further and preview each file.

Add input

To add the selected data source as an input, click Add input.

Now, when you go to the Admin page and select the Data Inputs tab, your selected data source is listed.

Note: Adding data inputs with crawl modifies your inputs.conf file to include a stanza describing the new source. For example, if crawl discovers /var/log, clicking Add input adds the following stanza to inputs.conf:

[tail:///var/log]
disabled = false
index = main
_class = crawl
_generator = ui

Save a crawl

After you run a crawl search, save the search by clicking the Save this Crawl... link located above your search results. This action opens the Admin > Data Inputs: Crawls: Create Crawl page which prompts you to:

  • Name your crawl search.
  • If necessary, edit your search.
  • If desired, elect to run your crawl on a schedule.
  • Click Cancel to return to the Admin > Data Inputs.
  • Click Save to save your crawl search.

Note: Your crawl won't save, if you don't provide a name.

Manage saved crawls

Manage your saved crawl searches from the Admin > Data Inputs: Crawls page. You can run a new crawl or select one or more saved crawls to:

  • Run Now and update your indexes.
  • Enable or Disable so that you can start or stop updating particular indexes.
  • Delete to remove the search from your list.

Edit the search and schedule properties of an individual crawl by clicking on its Name.

Note: You cannot change the name of your saved crawl.

Schedule saved crawls

When scheduling your saved crawls, you can define the type of schedule and how frequently to run it. You can also set alert options and define fields to include in summary indexes. These options are exactly the same as options provided for saving regular (non-crawl) searches.

Configure crawl

Configure crawl in two ways:

  • Configure default crawl settings in crawl.conf.
  • Override default settings at search time by specifying arguments (crawl options) for the crawl command. If you use crawl with no arguments, then Splunk uses all of the default settings in crawl.conf.

Edit crawl.conf to define and enable one or more crawlers that browse your data sources when you run the crawl command. You define each crawler by specifying values for each of the crawl options. You enable the crawler by adding it to crawlers_list.

Crawl logging

The crawl command produces a log of crawl activity that's stored in /splunkpreview/var/log/splunk/crawl.log. Set the logging level with the logging key in the [default] stanza.

Example:
Set the logging level of crawl to warn.

[default]
logging=warn

Enable crawlers

Enable a crawler by listing the crawler specification stanza name in the crawlers_list key of the [crawlers] stanza.

Use a comma-separated list to specify multiple crawlers.

Example:
Enable crawlers that are defined in the stanzas: [file_crawler], [port_crawler], and [db_crawler].

[crawlers]
crawlers_list= file_crawler, port_crawler, db_crawler

Define crawlers

Define a crawler by adding a definition stanza in crawl.conf. You can add additional crawler definitions by adding additional stanzas.

Example:

[Example_crawler_name]
....

[Another_crawler_name]
....

Add key/value pairs to crawler definition stanzas to set a crawler's behavior. The following keys are available for defining a file_crawler:

bad_directories_list= Specify directories to exclude.
bad_extensions_list= Specify file extensions to exclude.
bad_file_matches_list= Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*).
packed_extensions_list= Specify extensions of compressed files to exclude.
collapse_threshold= Specify the minimum number of files a source must have to be considered a directory.
days_sizek_pairs_list= Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size.
big_dir_filecount= Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify.
index= main Specify the name of the index to add crawled file and directory contents to.
max_badfiles_per_dir= Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory.

Example:
A simple file_crawler.

[simple_file_crawler]
bad_directories_list= bin, sbin, boot, mnt, proc, tmp, temp, home, mail, .thumbnails, cache, old
bad_extensions_list= mp3, mpg, jpeg, jpg,  m4, mcp, mid
bad_file_matches_list= *example*, *makefile, core.*
packed_extensions_list= gz, tgz, tar, zip
collapse_threshold= 10
days_sizek_pairs_list= 3-0,7-1000, 30-10000
big_dir_filecount= 100
index=main
max_badfiles_per_dir=100

Command syntax

| crawl [crawl option]...

Note:If you have any other command ahead of crawl in a search pipeline, Splunk automatically discards the data-generated ahead of crawl and outputs data generated from crawl. For example: If you have a search command ahead of a crawl command in your search, Splunk automatically discards the search results and outputs data generated from crawl.

Arguments

Note: The default values for crawl options are found in crawl.conf.spec.

file_crawler crawl options
crawl option bad_directories_list | bad_extensions_list | bad_file_matches_list | packed_extensions_list | collapse_threshold | days_sizek_pairs | big_dir_filecount | index | max_badfiles_per_dir Specify values to override key values in crawl.conf.
bad_directories_list bad_directories_list=string, string, ... Specify directories to exclude.
bad_extensions_list bad_extensions_list=string,string,... Specify file extensions to exclude.
bad_file_matches_list bad_file_matches=(string | string* | *string | *string* | *string*string | string*string*), ... Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*).
packed_extensions_list packed_extensions_list=string, string, ... Specify extensions of compressed files to exclude.
collapse_threshold collapse_threshold=integer (default=3) Specify the minimum number of files a source must have to be considered a directory.
days_sizek_pairs days_sizek_pairs=integer(days)-integer(kb), ... (default= 7-0, 30-1000) Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size.
big_dir_filecount big_dir_filecount=integer (default=10000) Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify.
index index=string (default=main) Specify the name of the index to add crawled file and directory contents to.
max_badfiles_per_dir max_badfiles_per_dir=integer (default=100) Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory.

Examples

The following command tells Splunk to browse for:

  • Directories that have no more than 100 files
  • Directories that were modified within the last 3 to 7 days and are between 0 to 1000kb in size

| crawl big_dir_filecount=100 days_sizek_pairs_list= 3-0,7-1000Search

crawl.conf

This is the default crawl.conf file that ships with Splunk.

# Copyright (C) 2005-2008 Splunk Inc.  All Rights Reserved.  Version 3.0
# 
# Crawl Configuration
#
# Set of attribute-values used by crawl.  
# 
# If attribute, ends in _list, the form is:
#
#      attr = val, val, val, etc.
#
# The space after the comma is necessary, so that "," can be used, as in BAD_FILE_PATTERNS's use of "*,v"
# Whitespace is stripped away and comments, such as this, are on lines that start with "#" 
#

[default]
logging = warn
  
[crawlers]
crawlers_list = file_crawler

[file_crawler]
# SEMICOLON SEPARATED LIST OF DIRECTORY LOCATIONS TO START FROM
root = /;/Library/Logs

# DIRECTORIES TO SKIP ALL TOGETHER. Consider "root" and "home"
bad_directories_list = bin, sbin, boot, mnt, proc, tmp, temp, dev, initrd, help, driver, drivers, 
share, bak, old, lib, include, doc, docs, man, html, images, tests, js, dtd, org, com, net, class, 
java, resource, locale, static, testing, src, sys, icons, css, dist, cache, users, system, resources, 
examples, gdm, manual, spool, lock, kerberos, .thumbnails, libs, old, manuals, splunk, mail, 
resources, documentation, applications, library, network, automount, mount, cores, lost\+found, fonts, 
extensions, components, printers, caches, findlogs, music, volumes, libexec,

# EXTENSIONS TO SKIP
bad_extensions_list = 0t, a, adb, ads, ali, am, asa, asm, asp, au, bak, bas, bat, bmp, c, cache, cc, 
cg, cgi, class, clp, com, conf, config, cpp, cs, css, csv, cxx, dat, doc, dot, dvi, dylib, ec, elc, 
eps, exe, f, f77, f90, for, ftn, gif, h, hh, hlp, hpp, hqx, hs, htm, html, hxx, icns, ico, ics, in, 
inc, jar, java, jin, jpeg, jpg, js, jsp, kml, la, lai, lhs, lib, license, lo, m, m4, mcp, mid, mp3, 
mpg, msf, nib, nsmap, o, obj, odt, ogg, old, ook, opt, os, os2, pal, pbm, pdf, pdf, pem, pgm, php, 
php3, php4, pl, plex, plist, plo, plx, pm, png, po, pod, ppd, ppm, ppt, prc, presets, ps, psd, psym, 
py, pyc, pyd, pyw, rast, rb, rc, rde, rdf, rdr, res, rgb, ro, rsrc, s, sgml, sh, shtml, so, soap, sql, 
ss, stg, strings, tcl, tdt, template, tif, tiff, tk, uue, v, vhd, wsdl, xbm, xlb, xls, xlw, xml, xsd, 
xsl, xslt, jame, d, ac, properties, pid, del, lock, md5, rpm, pp, deb, iso, vim, lng, list

# IMPLIED "$" (END OF FILENAME) AFTER EACH PATTERN HERE
bad_file_matches_list = *~, *#, *,v, *readme*, *install, (/|^).*, *passwd*, *example*, *makefile, core.*
packed_extensions_list = bz, bz2, tbz, tbz2, Z, gz, tgz, tar, zip

# ADD A DIRECTORY, RATHER THAN INDIVIDUAL FILES, IF IT HAS 1000 OR MORE FILES
collapse_threshold = 1000

# PAIRS OF MAXIMUM AGE AND MINIMUM SIZE.
# default is to accept text/archived files modified in he last 7 days
# with 0k, or modified in the last 30 days if it has at least 1000k
days_sizek_pairs_list = 7-0, 30-1000

# SKIP DIRECTORIES WITH TOO MANY FILES
big_dir_filecount = 10000

# DEFAULT INDEX TO ADD FILES
index = main

# SKIP DIRECTORIES AFTER INVESTIGATING N FILES WITHOUT FINDING SOMETHING WORTHWHILE
max_badfiles_per_dir = 100

Previous: Changes to Splunk Web    |    Next: Search command: addinfo

Comments

No comments have been submitted.

Log in to comment.