Splunk Preview introduces a new search feature, crawl, that searches your filesystem for new data sources to add to your index. Configure one or more types of crawlers in crawl.conf to define the type of data sources to include in or exclude from your results. Save this crawl search and schedule it to run regularly to update your indexes.
This topic explains how to use the crawl command, save and schedule a crawl search, and configure different crawlers.
Note: Splunk Preview currently supports one type of crawler, labeled file_crawler. As yet, you cannot define a custom crawler.
Use crawlIn Splunk Web, you can access and run the crawl command from the Splunk search bar and the Admin > Data Inputs: Crawls page.
The Splunk search bar
You can run the crawl command directly from the search bar:
The Admin page
You can manage all your saved crawls from the Admin > Data Inputs: Crawls page. From this page, you can also run the default crawl search by clicking New Crawl:
For each item listed in your crawl results, Splunk displays whether or not it is a file, a timestamp indicating when it was last modified, its size, and its status (whether it is added or not added to your inputs). You can perform two actions on each data source: Add input and Preview file/directory.
Preview file or directoryTo review the contents of the data source before adding it as an input, click Preview file or Preview directory.
A new window opens:
To add the selected data source as an input, click Add input.
Now, when you go to the Admin page and select the Data Inputs tab, your selected data source is listed.
Note: Adding data inputs with crawl modifies your inputs.conf file to include a stanza describing the new source. For example, if crawl discovers /var/log, clicking Add input adds the following stanza to inputs.conf:
[tail:///var/log] disabled = false index = main _class = crawl _generator = ui
After you run a crawl search, save the search by clicking the Save this Crawl... link located above your search results. This action opens the Admin > Data Inputs: Crawls: Create Crawl page which prompts you to:
Note: Your crawl won't save, if you don't provide a name.
Manage saved crawlsManage your saved crawl searches from the Admin > Data Inputs: Crawls page. You can run a new crawl or select one or more saved crawls to:
Edit the search and schedule properties of an individual crawl by clicking on its Name.
Note: You cannot change the name of your saved crawl.
Schedule saved crawlsWhen scheduling your saved crawls, you can define the type of schedule and how frequently to run it. You can also set alert options and define fields to include in summary indexes. These options are exactly the same as options provided for saving regular (non-crawl) searches.
Configure crawlConfigure crawl in two ways:
Edit crawl.conf to define and enable one or more crawlers that browse your data sources when you run the crawl command. You define each crawler by specifying values for each of the crawl options. You enable the crawler by adding it to crawlers_list.
Crawl loggingThe crawl command produces a log of crawl activity that's stored in /splunkpreview/var/log/splunk/crawl.log. Set the logging level with the logging key in the [default] stanza.
Example:
Set the logging level of crawl to warn.
[default] logging=warn
Enable a crawler by listing the crawler specification stanza name in the crawlers_list key of the [crawlers] stanza.
Use a comma-separated list to specify multiple crawlers.
Example:
Enable crawlers that are defined in the stanzas: [file_crawler], [port_crawler], and [db_crawler].
[crawlers] crawlers_list= file_crawler, port_crawler, db_crawler
Define a crawler by adding a definition stanza in crawl.conf. You can add additional crawler definitions by adding additional stanzas.
Example:
[Example_crawler_name] .... [Another_crawler_name] ....
Add key/value pairs to crawler definition stanzas to set a crawler's behavior. The following keys are available for defining a file_crawler:
| bad_directories_list= | Specify directories to exclude. |
| bad_extensions_list= | Specify file extensions to exclude. |
| bad_file_matches_list= | Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*). |
| packed_extensions_list= | Specify extensions of compressed files to exclude. |
| collapse_threshold= | Specify the minimum number of files a source must have to be considered a directory. |
| days_sizek_pairs_list= | Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size. |
| big_dir_filecount= | Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify. |
| index= main | Specify the name of the index to add crawled file and directory contents to. |
| max_badfiles_per_dir= | Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory. |
Example:
A simple file_crawler.
[simple_file_crawler] bad_directories_list= bin, sbin, boot, mnt, proc, tmp, temp, home, mail, .thumbnails, cache, old bad_extensions_list= mp3, mpg, jpeg, jpg, m4, mcp, mid bad_file_matches_list= *example*, *makefile, core.* packed_extensions_list= gz, tgz, tar, zip collapse_threshold= 10 days_sizek_pairs_list= 3-0,7-1000, 30-10000 big_dir_filecount= 100 index=main max_badfiles_per_dir=100
| crawl [crawl option]...
Note:If you have any other command ahead of crawl in a search pipeline, Splunk automatically discards the data-generated ahead of crawl and outputs data generated from crawl. For example: If you have a search command ahead of a crawl command in your search, Splunk automatically discards the search results and outputs data generated from crawl.
ArgumentsNote: The default values for crawl options are found in crawl.conf.spec.
file_crawler crawl options
| crawl option | bad_directories_list | bad_extensions_list | bad_file_matches_list | packed_extensions_list | collapse_threshold | days_sizek_pairs | big_dir_filecount | index | max_badfiles_per_dir | Specify values to override key values in crawl.conf. |
| bad_directories_list | bad_directories_list=string, string, ... | Specify directories to exclude. |
| bad_extensions_list | bad_extensions_list=string,string,... | Specify file extensions to exclude. |
| bad_file_matches_list | bad_file_matches=(string | string* | *string | *string* | *string*string | string*string*), ... | Specify a string, or a comma-separated list of strings that filenames must contain to be excluded. You can use wildcards (examples: foo*.*,foo*bar, *baz*). |
| packed_extensions_list | packed_extensions_list=string, string, ... | Specify extensions of compressed files to exclude. |
| collapse_threshold | collapse_threshold=integer (default=3) | Specify the minimum number of files a source must have to be considered a directory. |
| days_sizek_pairs | days_sizek_pairs=integer(days)-integer(kb), ... (default= 7-0, 30-1000) | Specify a comma-separated list of age (days) and size (kb) pairs to constrain what files are crawled. For example: days_sizek_pairs_list = 7-0, 30-1000 tells Splunk to crawl only files last modified within 7 days and at least 0kb in size, or modified within the last 30 days and at least 1000kb in size. |
| big_dir_filecount | big_dir_filecount=integer (default=10000) | Set the maximum number of files a directory can have in order to be crawled. crawl excludes directories that contain more than the maximum number you specify. |
| index | index=string (default=main) | Specify the name of the index to add crawled file and directory contents to. |
| max_badfiles_per_dir | max_badfiles_per_dir=integer (default=100) | Specify how far to crawl into a directory for files. If Splunk crawls a directory and doesn't find valid files within the specified max_badfiles_per_dir, then Splunk excludes the directory. |
The following command tells Splunk to browse for:
This is the default crawl.conf file that ships with Splunk.
# Copyright (C) 2005-2008 Splunk Inc. All Rights Reserved. Version 3.0 # # Crawl Configuration # # Set of attribute-values used by crawl. # # If attribute, ends in _list, the form is: # # attr = val, val, val, etc. # # The space after the comma is necessary, so that "," can be used, as in BAD_FILE_PATTERNS's use of "*,v" # Whitespace is stripped away and comments, such as this, are on lines that start with "#" # [default] logging = warn [crawlers] crawlers_list = file_crawler [file_crawler] # SEMICOLON SEPARATED LIST OF DIRECTORY LOCATIONS TO START FROM root = /;/Library/Logs # DIRECTORIES TO SKIP ALL TOGETHER. Consider "root" and "home" bad_directories_list = bin, sbin, boot, mnt, proc, tmp, temp, dev, initrd, help, driver, drivers, share, bak, old, lib, include, doc, docs, man, html, images, tests, js, dtd, org, com, net, class, java, resource, locale, static, testing, src, sys, icons, css, dist, cache, users, system, resources, examples, gdm, manual, spool, lock, kerberos, .thumbnails, libs, old, manuals, splunk, mail, resources, documentation, applications, library, network, automount, mount, cores, lost\+found, fonts, extensions, components, printers, caches, findlogs, music, volumes, libexec, # EXTENSIONS TO SKIP bad_extensions_list = 0t, a, adb, ads, ali, am, asa, asm, asp, au, bak, bas, bat, bmp, c, cache, cc, cg, cgi, class, clp, com, conf, config, cpp, cs, css, csv, cxx, dat, doc, dot, dvi, dylib, ec, elc, eps, exe, f, f77, f90, for, ftn, gif, h, hh, hlp, hpp, hqx, hs, htm, html, hxx, icns, ico, ics, in, inc, jar, java, jin, jpeg, jpg, js, jsp, kml, la, lai, lhs, lib, license, lo, m, m4, mcp, mid, mp3, mpg, msf, nib, nsmap, o, obj, odt, ogg, old, ook, opt, os, os2, pal, pbm, pdf, pdf, pem, pgm, php, php3, php4, pl, plex, plist, plo, plx, pm, png, po, pod, ppd, ppm, ppt, prc, presets, ps, psd, psym, py, pyc, pyd, pyw, rast, rb, rc, rde, rdf, rdr, res, rgb, ro, rsrc, s, sgml, sh, shtml, so, soap, sql, ss, stg, strings, tcl, tdt, template, tif, tiff, tk, uue, v, vhd, wsdl, xbm, xlb, xls, xlw, xml, xsd, xsl, xslt, jame, d, ac, properties, pid, del, lock, md5, rpm, pp, deb, iso, vim, lng, list # IMPLIED "$" (END OF FILENAME) AFTER EACH PATTERN HERE bad_file_matches_list = *~, *#, *,v, *readme*, *install, (/|^).*, *passwd*, *example*, *makefile, core.* packed_extensions_list = bz, bz2, tbz, tbz2, Z, gz, tgz, tar, zip # ADD A DIRECTORY, RATHER THAN INDIVIDUAL FILES, IF IT HAS 1000 OR MORE FILES collapse_threshold = 1000 # PAIRS OF MAXIMUM AGE AND MINIMUM SIZE. # default is to accept text/archived files modified in he last 7 days # with 0k, or modified in the last 30 days if it has at least 1000k days_sizek_pairs_list = 7-0, 30-1000 # SKIP DIRECTORIES WITH TOO MANY FILES big_dir_filecount = 10000 # DEFAULT INDEX TO ADD FILES index = main # SKIP DIRECTORIES AFTER INVESTIGATING N FILES WITHOUT FINDING SOMETHING WORTHWHILE max_badfiles_per_dir = 100
Comments
No comments have been submitted.