This document last updated: 09/12/08 03:09pm

Print Admin Manual

How Splunk Works

How Splunk Works

This section provides an overview of how Splunk works as an orientation for new administrators of Splunk deployments. Later sections of this manual provide step-by-step instructions for various configuration and management tasks.

About Splunk

Splunk is an IT Search engine. It is software that indexes any format of IT data from any source in real time, including logs, configurations, scripts, code, messages, traps, alerts, activity reports, stack traces and metrics from all of your applications, servers and devices. Splunk lets you search, navigate, alert and report on all your IT data in real time using an AJAX web interface.

Indexing

Splunk indexes data in real time. It accesses data using a variety of input methods, applies universal processing techniques to handle any format of IT data, and persists the original raw data along with indexes and additional fields added during processing. Data is persisted in one or more Splunk datastores using the file system.

Data access

Splunk can access data in a variety of ways:

Read more about configuring Splunk for a variety of inputs.

SplunkBase offers several scripted inputs already developed by Splunk including access to remote databases over DBI, access to Checkpoint and other firewalls via the OPSEC LEA protocol, and operating system diagnostic commands such as vmstat and iostat. Read more about configuring scripted inputs.

Source and host identification

The input processors capture both the raw data, and the name of the source and host where the data originated. The data is then placed in a queue for processing. The default host is the host where the data was accessed, but Splunk can be configured to extract a host from a filename or path in the case where Splunk is indexing files that have been copied or mounted from other hosts.

Data processing

All data, regardless of input method, then passes through the same series of processors to enrich and organize the data prior to persisting it in the index. These processors are designed to learn to handle new source types automatically and on-the-fly, so that there is no need to create any type of adapter for particular log formats.

Source classification

First, Splunk classifies each source by comparing its pattern-based signature to sources Splunk has seen before. If the signature is close enough, Splunk will assign the source the same source type as very similar previously seen sources. If the signature is very different, Splunk will assign a new source type name. Source types are used both to enable users to search for all similar sources, and to allow Splunk to remember how to process specific sources in order to be more efficient when the same source is seen again.

Read more about sourcetypes, including how to set a sourcetype for a specific input.

Timestamp recognition

Next, Splunk tries to find timestamps in each line of data. It can automatically recognize timestamps in virtually any format. Lines without timestamps are assigned the timestamps of preceding lines. Sources without timestamps at all use the time of processing. Splunk records a normalized timestamp with each event.

Read more about timestamps, including how to configure Splunk to recognize custom timestamps.

Multiline handling

Splunk then looks for repeating patterns of lines within each source to determine whether the source is single-line, i.e. each event is one line. If the source is multiline, Splunk will find the likely event boundaries and treat all lines as a single event.

Punctuation and other attributes

Splunk extracts the event's punctuation pattern, to enable users to later find all events of similar structure. It also captures other attributes of the event including minute, hour, day of week, day of month, linecount and many other characteristics that will then be indexed and searchable.

Event type discovery

Splunk watches all events as they are being indexed. For each event, it mines combinations of terms and punctuation patterns that are likely to be effective classifications for helping users find like events and break down large search result sets. For example, Splunk will likely discover that the punctuation pattern for an event from a specific firewall, plus the word "deny," makes one good event type; and that the same punctuation pattern plus the word "accept" makes another. You can also set your own rules for event type recognition.

Segmentation

The final stage prior to indexing is to segment the event into searchable terms based on breaking characters. Segmentation can be configured depending on your data.

Custom processing

All events in the preceding processing steps are handled automatically based on default settings and learning that works for virtually any dataset. Additionally, Splunk can be configured for custom processing, as well as to disable or change the behavior of most default processing.

Any custom processing properties can be set to apply to particular hosts, sources or source types.

Host extraction

For data sources that centralize events from many different hosts or devices prior to being picked up by Splunk, the host where Splunk accessed the event may not be the true originating host. In that case, you can configure Splunk to extract a host value based on a pattern match in each event and use that to override the value of the host field prior to indexing. Splunk is preconfigured to do this for the syslog sourcetype.

Meta-events

Splunk lets you navigate and summarize related events by time and keywords on the fly. However, sometimes it's convenient to index all events of a transaction as a single searchable event. In that case, you can configure Splunk to create meta-events at index time based on timing and shared transaction ids.

Timezone offsets

Splunk will automatically normalize timestamps when there is a timezone in the timestamp of the event itself. If you are indexing a directory that has a mix of events forwarded outside of Splunk from a mix of hosts with different timezones, you can configure timezone offsets to be applied on a per host basis.

Custom indexed fields

Splunk indexes all of the terms in the entire event so you don't need to define any fields in order to be able to do freeform searches. However, if you want to be able to search for the value of a particular field by name and avoid false matches of the same string appearing elsewhere in events, you can match and index custom fields. For example, you may define a pattern to differentiate sourceip and destip values.

Custom indexed fields are also useful to implement granular access controls.

Filtering & masking

You may have some events that are not useful. You can configure Splunk to match and discard events based on patterns. You may also have elements within your events that are confidential or private and not necessary to Splunk users, such as social security numbers. You can configure Splunk to recognize and replace patterns within events.

Overriding automated processing

While Splunk's default behavior provides appropriate processing for most kinds of data, you may want to override this behavior to correct mistakes or in order to get faster indexing or more effective search for a particular dataset. For example, you may override the boundaries that Splunk determines for multiline events, or you may configure Splunk to index less densely to save space and increase indexing performance.

Data persistence

Once events are processed, Splunk builds a dense index on all of the terms in the raw events plus additional fields created during processing including timestamp, host, source, sourcetype and punct (short for punctuation.) These indexes are stored in the filesystem along with compressed files containing the original, raw data. Policies control data retention, archiving and retirement based on age or disk utilization. The filesystem-based datastore is compatible with write-once, read-many storage solutions as well as local disk, and SAN / NAS storage.

User interaction

Search, navigation & reporting

Users can perform freeform searches with booleans and wildcards using both the SplunkWeb and command line (CLI) interfaces. They can also summarize and analyze search results by piping them through advanced search language operators and by using SplunkWeb's interactive filtering, reporting and charting features.

Scheduling & alerting

Any search or report can be scheduled, and rules can be defined to alert based on the number and content of scheduled search results. Results and alerts can be sent via email, RSS, SMS, SNMP and syslog. Administrators can extend alerting options with custom alerting scripts.

Knowledge capture & sharing

Splunk is designed to capture the knowledge of its users on-the-fly and facilitate pooling of knowledge among different users. All of Splunk's knowledge capture features are designed to work at search time, and enrich already indexed data.

Saving searches, alerts & reports

Users can save any search or report and define alerts based on any saved search. They can choose to keep saved searches and reports to themselves, or share them with all other users.

Dashboards

Users can pull any saved search or report as well as other data onto an unlimited number of custom dashboards. Administrators can define starting dashboards for each user role.

Event types

Users can rename, tag or delete auto-discovered event types in order to make them more descriptive of their own data. They can also save any search as an event type definition. These changes reclassify already indexed data on-the-fly to make events more informative, and let users easily find all events sharing common characteristics.

Event types are shared amongst all users of a given Splunk server, so novices naturally benefit from the contributions of experts in particular datasources.

Different event typing and tagging schemes can co-exist harmoniously, so one team-member can classify events from a security perspective while another classifies the same events differently from an operational perspective.

Source type aliasing

Automatically assigned source type names can be aliased in the search interface to more friendly or accurate names. Aliases can also mask excessively granular source typing for users.

Host tagging

Hosts can be tagged with one or more words describing their function or type, enabling users to easily search for all activity on a group of similar servers.

Custom extracted fields

Splunk automatically extracts and enables users to search and filter on fields that are obviously discoverable in events. These extracted fields are found entirely at search time, so they don't require accurate parsing into a brittle schema. Administrators can extend this capability with custom extracted fields based on patterns they define. These extracted fields are applied at search time, so data doesn't need to be re-indexed to benefit from new field definitions.

Users and access controls

Splunk provides its own native authentication as well as integration with LDAP servers including Active Directory, OpenLDAP and Novell eDirectory. By default it has three standard roles - user, power and admin - and all users can search all data in a given Splunk server. Splunk 3.0 includes granular access controls, where different LDAP roles can be mapped to subsets of data based on values in indexed fields.

Distributed deployments

Splunk has been designed to support a variety of distributed deployment models to enable real-time centralization of data across hundreds or thousands of source hosts, applications and devices as well as to scale linearly and provide a centralized view across different data silos. There are two primary ways in which different Splunk servers communicate.

Data distribution

Splunk servers can route data to one another (as well as to other systems) in real time during indexing, so that data can be accessed on one host and sent to another host for indexing and search. Splunk servers can route data to load-balanced groups of other Splunk servers, to enable horizontal scaling via clustered indexing. Splunk servers can also clone data to multiple load-balanced groups of other Splunk servers to provide for data redundancy in high availability environments.

Distributed search

Splunk servers can be configured to distribute search requests to other Splunk servers and merge the results back to the user. This combines with load-balanced indexing to provide horizontal scaling to search and index hundreds of gigabytes or terabytes per day.

Distributed search also provides a good way to let select users correlate data across different data silos, even loosely coupling different deployments of Splunk that originated in different teams in the same organization.

Configuration

Splunk uses a modular, file-based configuration model that provides for easy upgrades, migration of configuration amongst different Splunk servers, and sharing of solutions between different deployments. It is designed to work well with 3rd party configuration management tools, while also offering its own robust central configuration management capabilities.

Configuration files

Splunk stores all of its configurations in files consisting of simple attribute-value pairs. This includes saved searches, alerts, reports and event types created by users, input and indexing configurations, distributed search configurations, data cloning and routing configurations, SplunkWeb interface settings and user preferences.

Bundles

All of these configuration files exist within directories that are known as bundles. The configurations in every bundle are merged to build the overall configuration for a given Splunk server. Bundles can be migrated between installations, and can update default configurations non-destructively.

Deployment server

In a distributed deployment, any Splunk server can be configured as a deployment server and can control the configuration of all other Splunk servers. Splunk servers that act as deployment clients can poll their deployment server periodically for new configuration, or IP multicast can be used. When new configuration is detected, they can pull the configuration from the deployment server. Servers with common configurations can be controlled as a single class, such as Splunk servers that are accessing the same files across hundreds of identical application server hosts.

Configuration interfaces

The SplunkWeb interface provides an admin section that provides for basic configuration of data inputs, distributed search, data routing, users and saved searches. The Splunk CLI provides command-based configuration with additional capabilities to control deployment servers and clients as well as index management.

Splunk 3.0 introduces a new, more powerful way of viewing and defining configuration details by enabling you to search your configuration files from the same search interface as you search indexed data. SplunkWeb allows administrators to edit configurations inline with search results. This approach is currently limited to select configuration elements but will be expanded in future releases to become the primary means of configuring Splunk.

SplunkBase

SplunkBase is a hosted service that is designed to be used by all members of the Splunk community in conjunction with their Splunk software to enable them to share and find solutions and pool their knowledge of IT data.

SplunkBase integrates with the Splunk software in the following ways:

Developer APIs and integration features

Splunk's software is designed to be extensible by users and easy to integrate with other tools in the environment. A browser toolbar and permalink URLs let you launch Splunk from within other interfaces with contextually relevant searches. Search and management APIs let developers build their own interface or integrate Splunk searches into existing interfaces. A modules API lets developers extend Splunk's processing with custom C/C++ processors. The SplunkWeb interface is designed to be easily skinnable, customizable and embeddable. The Developer Guide provides details on using developer APIs and integration features.

Licensing

Splunk is a single software download that can run with different license levels. You can use Splunk for up to 7 days without a license. After that you can add either a free or enterprise license. Enterprise licenses are sold on a perpetual basis. We also offer 30-day trial Enterprise licenses.

Some features are limited to the enterprise license, including access controls, distributed search, receiving data routed from other Splunk servers and deployment server.

Licenses also include limits on the volume that can be indexed daily. When licenses expire, or if there are an excessive number of license volume violations, searches will be blocked and an error message will appear. However, we won't stop indexing your data, as we don't want you to lose critical data due to an unanticipated change in your environment. You can resume searching once the indexing volumes have returned within your volume limits or you apply a new license with a higher volume level or new expiration date.

Getting Started

About the getting started section

The getting started section is designed for those who have just downloaded Splunk for the first time, and want to install and use it quickly.

After perusing this section, you will be able to install and configure a single server. Once you know the basics, read on through the rest of the Admin Manual to learn more about how Splunk works, how to configure Splunk for any type of input and how to deploy Splunk across dozens, hundreds or even thousands of servers.

Installation

Installation of Splunk is easy, but do read the instructions in the Installation Manual. Make sure you are installing the correct version of Splunk and that you are installing on a supported filesystem.

Administration Basics

The $SPLUNK_HOME variable refers to the top level directory of your installation. By default, this is opt/splunk/.

Add Splunk to your shell path

To save a lot of typing, set a SPLUNK_HOME environment variable and add $SPLUNK_HOME/bin to your shell's path. The example below works for bash users who accepted the default installation location. Use the correct syntax and path for your own installation.

# export SPLUNK_HOME=/opt/splunk
# export PATH=$SPLUNK_HOME/bin:$PATH

Splunk's CLI

Splunk's command line interface is located in $SPLUNK_HOME/bin/. If you have exported the path and environment variables (as explained above), you can use the splunk command as follows:

# splunk [action] [object] [-parameter value] ....

If you haven't set an environment variable, navigate to $SPLUNK_HOME/bin/ and run commands as follows:

#./splunk [action] [object] [-parameter value] ....

For general help, type:

# splunk help

For a list of commands and options, type:

# splunk help commands

When using Splunk with an Enterprise license, administration commands must be authenticated with a username and password.

To authenticate for an entire session, type:

# splunk login

You will be prompted for a Splunk username and password. This is the same username and password you use to log into the SplunkWeb interface. By default, the login is set to admin and the password is changeme.

You can logout at any time by typing:

# splunk logout

To authenticate a single command, use the -auth parameter:

# splunk search foo -auth username:password

Please note: the -auth string must be the last term in the CLI command.

Start/stop Splunk, check status

Ensure that you have added Splunk to your server host's path (as explained above, in "Adding Splunk to your shell path"). Otherwise you must use the ./splunk command.

Start the Server

From a shell prompt on the sever host, run this command:

# splunk start

You can also restart the server by running:

# splunk restart

Stop the Server

To shut down the Splunk Server, run this command:

# splunk stop

Check if Splunk is running

To check if Splunk is running, type this command at the shell prompt on the sever host:

# splunk status

You should see this output:

splunkd is running (PID: 3162).
splunk helpers are running (PIDs: 3164).
splunkweb is running (PID: 3216).

Or you can use ps to check for running Splunk processes:

# ps aux | grep splunk | grep -v grep

Solaris users, type -ef instead of aux:

# ps -ef | grep splunk | grep -v grep

Help

Help is available in several forms.

Help Options

Change defaults

Changing the admin default password

Splunk with an Enterprise license has a default administration account and password. It is highly recommended that you change the default. You can do this via the Splunk CLI or SplunkWeb interface.

Please note: CLI commands assume you have set a Splunk environment variable. If you have not, you must navigate to $SPLUNK_HOME/bin and run the ./splunk command.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/users.jpg

via Splunk CLI

The Splunk CLI command is:

# splunk edit user

Please note: you must authenticate with the existing password before it can be changed. You can do this by logging into Splunk via the CLI or using the -auth parameter.

For example:

# splunk edit user admin -password foo -auth admin:changeme

This command changes the admin password from changeme to foo.

Changing network ports

Splunk uses two ports:

These are the default settings; your installation may be configured differently.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/ports.jpg

via Splunk CLI

To change the port settings via the Splunk CLI, use the CLI command set.

# splunk set web-port 9000

This command sets the SplunkWeb port to 9000.

# splunk set splunkd-port 9089

This command sets the splunkd port to 9089.

Changing the default Splunk server name

The Splunk server name setting controls both the name displayed within the SplunkWeb interface and the name sent to other Splunk Servers in a distributed setting.

The default name is taken from either the DNS or IP address of the Splunk Server host.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/ports.jpg

via Splunk CLI

To change the server name via the CLI, type the following:

# splunk set servername foo

This command will set the servername to foo.

Changing the datastore location

The datastore is the top-level directory where the Splunk Server stores all indexed data, user accounts, and working files.

Please note: If you change this directory, the server won't migrate old datastore files. It will start over again at the new location.

To migrate your data to another directory follow the instructions in Move an index.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/datastore.jpg

via Splunk CLI

To change the server name via the CLI, type the following:

# splunk set datastore-dir /var/splunk/

This command will set the datastore directory to /var/splunk/.

Set minimum free disk space

The minimum free disk space setting controls how low disk space in the datastore location can fall before Splunk stops indexing.

Splunk will resume indexing when more space becomes available. For detailed information on how to manage Splunk server disk usage, see Disk usage.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_changedefaults/datastore.jpg

via Splunk CLI

To change the server name via the CLI, type the following:

# splunk set minfreemb 2000

This command will set the minimum free space to 2000 MB.

Start Splunk for the first time

Start the server

Splunk's command line interface is located in $SPLUNK_HOME/bin/. Navigate to this location and run the following command:

# ./splunk start

Use whatever path you installed under.

The first time you start splunk after a new installation, you will be presented with the license agreement and asked to accept the license. If you want to bypass these steps, you can start splunk and accept the license in one step:

# ./splunk start --accept-license

Please note: there are two dashes before the accept-license option.

Load SplunkWeb in your browser

Navigate to:

http://mysplunkhost:8000

Use whatever host and port you chose during installation.

The first time you login to Splunk with an Enterprise license, use username admin and password changeme. Splunk with a free license does not have access controls.

Find and index data

Add Data

Adding data inputs to Splunk is easy and there are many methods for getting your data into Splunk. You can add data via SplunkWeb, Splunk's CLI, scripts and 3rd party software. To get started quickly and simply, you can immediately begin tailing a file in a log file directory, such as /var/log.

When you first access SplunkWeb, there is a helpful action link to begin tailing /var/log locally:

http://www.splunk.com/assets/doc-images/30_admin1_findindex/largergettingstarted.jpg

There are many other ways to set up data inputs into Splunk. This section is a high-level description of these techniques. For more detailed methods, see the data inputs section.

Tail a file

When you specify a file to tail, Splunk will process the entire file and then watch and process additions to the file. When you give a directory name to process, Splunk recursively searches all subdirectories looking for files resembling log files. You can explicitly include or exclude files with whitelisting and blacklisting.

Tailing files via the SplunkWeb Admin Interface

Tailing files via the CLI

Use the splunk add command. These commands assume you have set a Splunk environment variable. If you have not, you must navigate to $SPLUNK_HOME/bin and run the ./splunk command.

For example:

splunk add tail /var/log/

This command tails all files in /var/log/.

Find logfiles

Splunk has a built-in command to search for potential log files to index. From the Splunk CLI, type:

splunk find logs "searchpath1;searchpath2;..."

This command will search for logs in any number of searchpaths.

The details of which directories and file types are ignored, as well as file size and modification date restrictions are defined in $SPLUNK_HOME/etc/findlogs.ini.

After logs are found, you will have the option of indexing some, all, or none of the files. If you answer Some, you will be prompted file-by-file.

Get help configuring your sources

New data sources can crop up daily. If you do not see the means to configure a data source to send the data you want to Splunk, be sure to look on SplunkBase or contact Splunk Support.

Add more users

There are three user roles and two different authentication models to choose from when you set up Splunk with an Enterprise license. Users are authenticated using the Splunk server or LDAP.

You must be logged in as a Splunk administrator to add or edit user accounts. The default Admin account password is changeme.

Please note: Splunk with a free license does not enable access control features.

Lost admin password

Should you lose the password to the sole admin account for your installation, contact Splunk Support for assistance in restoring it. For security reasons there is no simple hack to get around a lost password.

User roles

Splunk local users

As a Splunk Admin, you can create new users either via SplunkWeb or Splunk's CLI.

via SplunkWeb

http://www.splunk.com/assets/doc-images/30_admin1_addusers/adminbutton.jpg

http://www.splunk.com/assets/doc-images/30_admin1_addusers/users.jpg

via Splunk CLI

From the CLI, you can use the following commands to add, edit, remove or list users.

add user username [-parameter value] ...
edit user username [-parameter value]  ...
remove user username [-parameter value]  ...
list user username [-parameter value]  ... 

Required (Default) Parameter:

username -- the name of the Splunk user account to manage.
full-name -- real name of user in quotes, for example "Nikola Tesla" - required when adding a new user.

Optional Parameters:

full-name -- real name of user in quotes, for example "Nikola Tesla"
password -- the password to set for the account
role -- either user, power or admin

Example:

This example assumes you have set a Splunk environment variable. If you have not, you must navigate to $SPLUNK_HOME/bin and run the ./splunk command.

# splunk edit user newbie -password f8h2.$R -auth admin:d3cidr

This example authenticates as user "admin" to change the password for user "newbie."

Please note: You must be logged in as an Admin to make any changes regarding users. You can either login via the splunk login command, or you can use -auth, as exemplified above.

LDAP

User authentication can be managed through LDAP. For the details of the Splunk LDAP integration, see LDAP Authentication.

Configuring a Splunk Deployment

Choose a Deployment Model

Splunk can be deployed on a single server or multiple servers, depending on your needs. You can install Splunk on a dedicated server or on a server running other applications -- Splunk does not require special hardware. In a multi-server deployment, Splunk can be configured to support redundancy, data balancing, distributed searching and data access control. This section defines the Splunk deployment terms and presents general deployment guidelines. Continue reading the deployment section for more specifics on these various configurations.

Single server

A single server installation refers to any network that contains only one running instance of Splunk. That instance can receive data over network ports (including syslog over TCP/UDP), and tail or watch locally available files and directories. Mounted fileshares can provide for access to files from other servers.

Distributed input

Splunk can be installed on each production server that is generating logs and other data that you want to index. These light-weight Splunk servers can be configured to forward their data to a central, dedicated Splunk server for indexing. The forwarding servers can also index a copy locally although that adds overhead and is rarely necessary.

Distributed indexing

Your production environment may generate more data than one Splunk server can index. In this case, you may want to distribute indexing across multiple Splunk servers. For example, if you have terabytes of data to index a day, you can distribute this data across 10 servers, each indexing 100 gigabytes per day.

You can distribute indexing amongst multiple servers in one of two ways. Either different subsets of Splunk servers can forward their data to different Splunk servers for indexing, or each Splunk server can balance its forwarding among the entire cluster of central indexing servers.

Data redundancy

You may want to index multiple copies of your events to ensure high availability of your data. In this case, each Splunk server that is capturing data clones its data to multiple Splunk servers for indexing. For more complete data redundancy, other network-based logging sources can also be configured to direct their traffic to multiple Splunk servers for indexing.

Partitioning

You may want to partition your data into separate data stores because you have varying policies or uses for different subsets of your data. Splunk allows you to create conditional rules for routing data from one Splunk server to multiple Splunk servers, or to different indexes on a single server.

Segregated data access

You can distribute events over multiple servers in order to limit data access to specific servers or classes of users. With conditional routing, events can be written to indexers based on the host, source or source type, or by applying regular expressions to events. For example, application log data can be sent to a server (or series of servers) and network events sent to another.

With granular access controls, a Splunk administrator can restrict access to indexers. Those users supporting specific applications can be granted access only to the indexers containing relevant data, while the Splunk senior administrator can maintain control over all servers.

Variable retention policy

You may wish to set different data retention policies depending on your sources, sourcetypes or hosts. Routing allows you to send your data to specific Splunk servers which may then apply different rules for data retirement.

Forwarding to non-Splunk systems

You may have a requirement to send data from Splunk to a non-Splunk system, for example, a Managed Security Service Provider (MSSP). In such an environment, Splunk may be used to log all relevant data for troubleshooting, reporting, compliance, or security investigations. Splunk then sends data to the third-party system via network ports.

LDAP integration

You can use LDAP to authenticate users and authorize their access to Splunk. This greatly simplifies the management of users in a multi-server Splunk environment.

Decide how Splunk should index your data

Now that you have chosen a deployment model, you should consider how you want Spunk to index your data. You have many options for increasing throughput and changing data handling. Pick the implementations that work best for you from the following list.

Create additional indexed fields

Splunk automatically adds fields such as host::, source::, sourcetype:: and timestamps. Splunk also indexes all of the terms in the entire event so you don't need to define any fields in order to do free form searches. However, if you want to be able to search for the value of a particular field by name, you can create and index custom fields. For example, you may define a pattern to differentiate sourceip and destip values.

Note: Consider whether you want to use search fields or extracted fields for your purpose. Extracted fields reduce indexing time and storage and also provide more flexibility to adapt without re-indexing your data. Search fields are useful for implementing granular access controls. If you decide to go the extracted field route, you don't need to worry about fields in your initial setup as you can add the rules for extraction at any time.

Tweak default processing

When Splunk indexes a data source, it automatically breaks the input into distinct events and extracts a host and timestamp for the event. The event boundaries, host, and timestamps are important for analysis. If Splunk is not setting the event boundaries or extracting timestamps and hosts correctly, you can easily modify these settings. See timestamp recognition, how host is assigned, and how events are recognized for more information.

Mask sensitive data

Your logs may contain sensitive personal data. For example, there may be social security numbers or passwords in your data that you may wish to cover up. You can create event configuration that masks sensitive data as it is being processed on input.

Change indexing density

When Splunk indexes data, it segments events via major and minor breakers. To save storage space on the indexer, you can Splunk's default segmentation settings. For example, web proxy logs may contain lengthy URLs that Splunk breaks into many different minor segments. You may wish to change this setting to eliminate unnecessary overhead.

Eliminate processing steps

Certain processing steps can be eliminated to provide faster indexing and better throughput. For example, if you don't need Splunk to search for timestamps within events, you can turn off timestamp extraction. You can also tune down or eliminate event type auto-discovery.

Filter and route data

You may find that you wish to handle various data sources differently prior to indexing. For example, you may want to eliminate any data that is not useful, or you may wish to route certain data to specific indexes. You can also route a subset of your data to third-party systems. You will want to decide exactly how to treat your data before setting up your deployment.

Please note: Any data that is removed prior to indexing does not count against your daily licensing limit.

Install Splunk on every host

Now that you have designed your deployment environment, you will need to install Splunk on every machine in your network that will be a Splunk host.

Configure the forwarding servers

Before you completely deploy Splunk on all your servers, you will want to create various configuration files for one example of each forwarding server class that will then be deployed across all its peers by the deployment server. This will allow you to validate your environment before you push configurations to every machine.

Define server classes

Managing multiple servers is easier if you break your servers up into logical groups. These groups are called server classes. Categorize your machines into server classes by which types of data they are logging. Here are some sample categories:

Each machine can be in as many server classes as you wish. More granularity of servers classes means more configuration files for future updates. It may be helpful to keep a spreadsheet of the configuration files you modify.

Inputs

Configure your data inputs locally on one server in each server class using the step-by-step instructions for input configuration. If you've decided that you need to set a custom host for a specific input, you will configure that at this point as well.

Processing properties

You should have already picked which processing properties to configure while deciding how Splunk should index your data. Here is an extensive list of all the settings you can change for your server classes:

Continue tweaking these settings until your data appears the way you want both locally and on the central indexer.

Please note: You will only need to set up configurations for event processing. Any custom configuration that happens during indexing or search time will be set up on the receiving servers.

Data distribution

This section refers to the design models outlined in Choose a Deployment Model. You will want to figure out which model works best for your topology, and then follow the links below to configure your server classes.

Data policy

You may have decided to set up variable data retention policies for different data. You will want to configure your server classes to forward to servers with matching data retention policies. Use routing to send your data to the correct server.

Authentication

Set up user accounts on each server class. You can set up LDAP, or use Splunk's built-in method. User settings are controlled in auth.conf.

Please note: you must use a consistent authentication method throughout your environment.

Set up the deployment server

You are now ready to configure the deployment server. To learn more about the deployment server, read how the deployment server works. Then, follow these steps for configuring your deployment.

Configure the deployment server

Follow the instructions to configure the deployment server. Use the configuration file deployment.conf to specify the variables for your deployment.

Set up deployment classes

Then, set up each client. Use the configuration file deployment.conf with specific variables for each client, including their server class.

Start Splunking

We are still working on this page. Please check back here later. If you need help on this topic now, contact Splunk Support.

Data Inputs

How input configuration works

Splunk's data inputs are specified via inputs.conf. In most cases, you will not need to modify these files as the required information is written when you configure data inputs through Splunk CLI or SplunkWeb. For more granularity, however, you may wish to configure settings via inputs.conf. Changes made via SplunkWeb or the Splunk CLI are stored in $SPLUNK_HOME/etc/bundles/local/inputs.conf.

As Splunk processes files, it assigns default values for sourcetype, hostname, and index for each file. You can override these setting when you define inputs, or you can modify them later.

Data input types

Data inputs have characteristics that are independent of how the inputs are defined. A tail input type behaves the same whether you add it in Splunk CLI or SplunkWeb This section describes the purpose, behavior, and rules/restrictions of the Splunk data input types.

The next section describes the mechanics of adding the data inputs via the various Splunk interfaces and how to modify indexing settings.

Files and directories

Data inputs can come from files and directories. Data in files can be processed in live or batch mode. Live input is for active log files and is handled through Splunk's tail processor. Batch input is for closed, archived data, and batch files are handled through the Upload or Watch/Batch processors.

Tailing a file

Splunk's tail behaves like the UNIX tail command. Specify a path to a file or directory whose contents should be indexed by the Splunk server, and Splunk will watch and consume any new input. If subdirectories exist, Splunk will recursively examine them for log files. If new files appear in a tailed directory, Splunk will add them to the index.

Please note: Starting with Splunk 3.0.2, the tail input method allows you to specify the option to have tail process files like UNIX tail -f. Specifically, you have the option to have tail read the end of a file and wait for new input rather than consume the entire file and wait for new input. This option is specified in inputs.conf with the followTail attribute. A value of 1 indicates to read from the end of the file. The default is 0, or read the entire file. This option will be ignored if the file has ever been indexed by Splunk.

In addition, when tailing a file for input:

When tailing a directory for input:

Please note: If the specified file or directory does not exist, the Splunk server will not check to see if it is created later. Splunk only checks for files and directories each time the Splunk server starts (or is restarted). So be sure to explicitly add new files as inputs when they become available if you don't want to restart the server. When tailing a file the entire path dir/filename must not exceed 1024 characters.

Batch upload and watch

Splunk has a batch processing module. It watches any specified directory on the local Splunk server's file system and then processes the entirety of any new file that appears. You can also upload archived files directly into Splunk for analysis. If necessary, Splunk will unpack and uncompress a file before indexing. Keep in mind that Splunk will need adequate disk space to uncompress these files, and that this processing can take more time than processing a live or uncompressed file.

By default, Splunk's batch processor is located in $SPLUNK_HOME/var/spool/splunk. You can set up your own watch directory as well.

Please note: This method will not keep watch on the files it has already seen, so it's not designed for live logfiles -- just rotated archive copies.

In addition, when batch uploading or watching, Splunk can:

FIFO queues

A FIFO (AKA named pipe) is a queue of data maintained in a Unix host's memory. It can be accessed like a file and log messages can be written to it. When choosing the FIFO data input method consider the following:

Network ports

UDP and TCP ports can feed data into the Splunk Server. UDP and TCP behave differently, and these behaviors effect how data arrives for processing. When configuring network ports, please keep in mind that you cannot use ports lower than 1024 if you have not installed Splunk as root.

UDP

UDP is a best effort protocol. This means that you might not get messages if the network is clogged, or has a hiccup. You also can't be absolutely sure the messages aren't spoofed or altered in transit. UDP should be reserved for logging implementations focused on day-to-day troubleshooting rather than compliance or security.

Splunk Enterprise can read directly from the network on any UDP port. This technique is most often used to make Splunk act directly as a syslog server by reading remote syslog events on UDP port 514. However, it also can be used for any other UDP source of logging data, including SNMP.

Like all of the network streaming-based approaches, direct UDP input is higher performance than reading files from disk.

TCP

TCP is a reliable, high-performance choice for many situations, as this protocol includes checks to ensure that data has arrived safely and intact. Splunk with an Enterprise license can receive data on any TCP port, allowing Splunk to receive remote data from syslog-ng and alternative syslog implementations that use TCP for security or reliability. This feature is the foundation of Splunk's distributed data access.

Please note: If the sending process buffers data such that events are broken into multiple pieces, Splunk may interpret the parts as multiple events. This is more likely if events are being generated intermittently, as there may be long pauses (several seconds or longer) between blocks of buffered data. If you notice truncated events, try forcing the process to send events atomically.

Scripted inputs

Splunk can be configured to run an arbitrary shell command on any schedule, and then pipe the output to Splunk for processing. Examples of shell scripts that process meaningful data for Splunk to digest include:

See Configure scripted inputs for details on how to set this up.

Indexing properties

A distinguishing characteristic of Splunk is that it can universally process any IT data, regardless of format. It automatically learns event boundaries, classifies events and sources, and finds timestamps. However, sometimes you may want to change or augment Splunk's default processing. This can be done via setting indexing properties in a props.conf file in the $SPLUNK_HOME/etc/bundles/local directory. (Read more about bundles.)

Some attributes within props.conf can be customized by defining new stanzas in other configuration files, most commonly transforms.conf, which defines regex-based rules for extracting fields, correlating events and performing other transformations. Segmenters.conf, outputs.conf and metaevents.conf can also define attribute values that can be referenced by props.conf.

Common use cases for custom indexing properties include:

Configure inputs via SplunkWeb

Follow these instructions to configure data inputs via SplunkWeb. You can also configure data inputs via Splunk's CLI or inputs.conf.

Configuration

http://www.splunk.com/assets/doc-images/30_admin4_configinputsweb/admin.jpg

http://www.splunk.com/assets/doc-images/30_admin4_configinputsweb/inputs.jpg

Files and directories

http://www.splunk.com/assets/doc-images/30_admin4_configinputsweb/tail.jpg

FIFO queues

http://www.splunk.com/assets/doc-images/30_admin4_configinputsweb/fifo.jpg

Network ports

With a Splunk Enterprise license, you can can define input from any TCP or UDP port.

To add Network Ports:

http://www.splunk.com/assets/doc-images/30_admin4_configinputsweb/ports.jpg

Configure inputs via the CLI

In addition to using Splunk Web or editing inputs.conf, you can also use Splunk Command Line Interface (CLI) commands to configure data inputs.

To access Splunk's CLI, navigate to the $SPLUNK_HOME/bin/ directory. Use a CLI command by typing $SPLUNK_HOME/bin/splunk [command name].

Note: Add Splunk to your shell path to use commands from any directory (by typing ./splunk [command name]).

Data input commands

Use Splunk CLI data commands to perform actions on data sources. Commands and data sources take various parameters depending on the combination you use. You can use five different commands to configure data inputs in the CLI:

Command Command syntax Action
add add [tail|watch|fifo|tcp|udp] source [-parameter value] ... Add a specified data input to Splunk.
edit edit [tail|watch|fifo|tcp|udp] source [-parameter value] ... Edit a data input was previously added.
remove remove [tail|watch|fifo|tcp|udp] source Remove a previously added data input.
list list [tail|watch|fifo|tcp|udp] List the currently configured data inputs of a specified type.
spool spool source Add a file, archive, or directory to your index by reading it once.

Note: Splunk's CLI help pages contain detailed syntax and usage information on all commands, objects, and parameters. Access the main CLI help page by typing: $SPLUNK_HOME/bin/splunk help. Individual commands, objects, and parameters have their own help pages as well, type: $SPLUNK_HOME/bin/splunk help [command/object/parameter name]

Data input types

You must specify a data input type to use with a data input command.

Data input type Definition
tail A file or directory to be continuously monitored for new input to index.
watch An archive directory to be monitored for new files to index.
fifo A FIFO or named pipe to index from.
tcp A TCP socket (network input) to monitor.
udp A UDP socket (network input) to monitor.

Change the configuration of each data input type by defining the parameters below.

Note: Optional parameters have the syntax: -parameter value.

Note: Use only one -hostname, -hostregex or -hostsegmentnum per command.

tail parameters

Required parameters

source Path to the file or directory to monitor for new input.

Optional parameters

sourcetype Specify a sourcetype field value for events from the input source.
index Specify the destination index for events from the input source.
hostname Specify a host name to set as the host field value for events from the input source.
hostregex Specify a regular expression on the source file path to set as the host field value for events from the input source.
hostsegmentnum Set the number of segments of the source file path to set as the host field value for events from the input source.
active-only (T | F) True or False. Set true to tell Splunk to only keep indexing files that have write-permissions enabled.
follow-only (T | F) True or False. Default False. When set to True, Splunk will read from the end of the source (like the "tail -f" Unix command).

watch parameters

Required parameters

source Path to a directory to watch for new input.

Optional parameters

method Set the method to bring files into Splunk (symlink or copy). Default is symlink.
sourcetype Specify a sourcetype field value for events from the input source.
index Specify the destination index for events from the input source.
hostname Specify a host name to set as the host field value for events from the input source.
hostregex Specify a regular expression on the source file path to set as the host field value for events from the input source.
hostsegmentnum Set the number of segments of the source file path to set as the host field value for events from the input source.

fifo parameters

Required parameters

source Path to a FIFO or named pipe to index.

Optional parameters

sourcetype Specify a sourcetype field value for events from the input source.
index Specify the destination index for events from the input source.
hostname Specify a host name to set as the host field value for events from the input source.
hostregex Specify a regular expression on the source file path to set as the host field value for events from the input source.
hostsegmentnum Set the number of segments of the source file path to set as the host field value for events from the input source.

tcp & udp parameters

Required parameters

source Port number to listen for data to index.

Optional parameters

sourcetype Specify a sourcetype field value for events from the input source.
index Specify the destination index for events from the input source.
hostname Specify a host name to set as the host field value for events from the input source.
remotehost Specify an IP address to exclusively accept data from.
resolvehost Set True of False (T | F). Default is False. Set True to use DNS to set the host field value for events from the input source.

Examples

tail

Tail only writable files in /var/log/.

1. Add /var/log/ as a data input.

./splunk add tail /var/log/

2. Edit the input you added to tail only files that are still open for writing.

./splunk edit tail /var/log -active-only true

watch

Watch a directory and set host and sourcetype field values for each event that's indexed.

1. Add a watch to the directory /mnt/archive and set the host field value for events from the source to be the third segment of the file name.

./splunk add watch /mnt/archive -hostsegmentnum 3

2. Edit the input configuration to set the sourcetype field value for each event from the source to equal "myApp".

./splunk edit watch /mnt/archive -sourcetype myApp

fifo

Configure a FIFO input and set the host and sourcetype field values for each event that's indexed.

1. Add the FIFO input /var/run/syslogfifo and set the sourcetype field for each event from the source to equal "linux_messages_syslog".

./splunk add fifo /var/run/syslogfifo -sourcetype linux_messages_syslog

2. Edit the input configuration to set the host field value for all events from the source to equal "web01".

./splunk edit fifo /var/run/syslogfifo -hostname web01

tcp & udp

Configure a network input and set the sourcetype field value for each event that's indexed.

1. Configure a UDP input to watch port 514 and set the sourcetype field value for each event to equal "syslog".

./splunk add udp 514 -sourcetype syslog

2. Set the UDP input to use DNS to resolve the host name and set each event's host value to the resolved host name. You must have root access for ports under 1024. Use the auth parameter to authenticate in line.

./splunk edit udp 514 -resolvehost true -auth gwb:d3c1dr

Scripted inputs

By configuring inputs.conf, Splunk can also accept events from scripts.

Scripted input is useful for and command-line tools, such as vmstat, iostat, netstat, top, etc.

Please note: currently, scripted inputs do not get bundled in the deployment server. In the future, Splunk will support this behavior. For now, please use your preferred configuration automation tool to push your script directory to your server classes.

Configuration

Please note: your script must be in the bin/ directory underneath your scripts/ directory.

[script://$SCRIPT] 
interval = X 
index = {main, $YOUR_INDEX}
sourcetype = {iostat, vmstat, etc}  OPTIONAL
source = {iostat, vmstat, etc} OPTIONAL
disabled = false

Variables:

Example

This example shows the use of the UNIX top command as a data input source.

$ mkdir $SPLUNK_HOME/etc/bundles/scripts
$ #!/bin/sh
 top -bn 1  # linux only - different OSes have different paramaters
chmod +x $SPLUNK_HOME/etc/bundles/scripts/bin/top.sh
$SPLUNK_HOME/etc/bundles/scripts/bin/top.sh
[script:///opt/splunk/etc/bundles/scripts/bin/top.sh]
interval = 5                # run every 5 seconds
sourcetype = top        # set sourcetype to top
source = script://./bin/top.sh   # set source to name of script

Please note:

[top]
BREAK_ONLY_BEFORE = GobblyGook
DATETIME_CONFIG = CURRENT

File whitelisting / blacklisting

You can use inputs.conf to specify files to ignore (blacklist) or only consume (whitelist) for any specific source that you are tailing. The match for blacklist and whitelist uses regular expression syntax on the file name.

Please note: For whitelist and blacklist entries, please use exact regex syntax. The "..." wildcard is not supported. Whitelist and blacklist configurations must be in a configuration stanza, those outside a stanza are ignored (no global entries.)

Configuration

Blacklist (ignore) files

Add the following argument=value to your tail input stanza in $SPLUNK_HOME/etc/bundles/local/inputs.conf:

_blacklist = $YOUR_CUSTOM_REGEX

Whitelist (allow) files

Add the following argument=value to your tail input stanza in $SPLUNK_HOME/etc/bundles/local/inputs.conf

_whitelist = $YOUR_CUSTOM_REGEX

Example

[tail:///mnt/logs]
    _whitelist = .*\.log

This example tells Splunk to tail only files with the .log extension.

[tail:///mnt/logs]
    _blacklist = .*\.txt

This example tells Splunk to ignore all files with the .txt extension.

[tail:///mnt/logs]
    _blacklist = \.(txt|gz)$

This example tells Splunk to ignore all files with either .txt or .gz extension.

Verification tool

To verify that your whitelist and blacklist rules are configured properly you should run the listtails utility found in your $SPLUNK_HOME/bin directory. Without interacting with the server in any way, the utility reads in the configuration of inputs.conf in all bundles, scans your directories and shows you the exact list of files that Splunk will tail when you restart.

Note: The listtails utility requires you to first run the command source setSplunkEnv

Hosts

How host works

Host is the physical device on the network where an event originated. The value of host may be an IP address, hostname, or fully qualified domain name. Host:: is a core field that is indexed and stored with every event.

Host provides an easy way to find all data originating from a given device. Tagging hosts lets you find data from a group of hosts with a common function or configuration.

How host is assigned

Default assignment

If no other host rules are specified for a source, host will be set to a default host value that applies to all data coming via inputs on a given Splunk server. The default host value is the hostname or IP address of the network host. When Splunk is running on the server where the event occurred (which is the most common case) this is correct and no manual intervention is required.

Learn how to set default host for a Splunk server.

Overriding host for remote archive files

If you are running Splunk on a central log archive, or you are working with files copied from other hosts in the environment, you may need to override the default assignment. You can define host assignment for an input based on either a custom host value for all data for that input or matching a portion of the path or filename of a source, such as when you have a directory structure that segregates the log archive for each host in a different subdirectory.

Centralized log server environment

In the case where there is a centralized log host sending events to Splunk, there may be many servers involved. The central log server is called the reporting host. The system where the event occurred is called the originating host (or just the host). In this case you will need to define rules to extract host per event.

Host tagging

The host:: field can be tagged to provide extra information. Host tagging allows multiple hosts to be clustered under ad-hoc categories for more robust searches.

Configuration files for host

Host can be set in inputs.conf. More advanced host extraction configurations require changes to transforms.conf and props.conf. Before manually modifying any configuration file, please read about bundle files.

Set default host for a Splunk server

The host value of an event is the hostname or IP address of the network host which originated the event. When Splunk is running on the server where the event occurred the assignment of host is straight forward. The default name is the host of the Splunk server.

via SplunkWeb

You can change the default host value via SplunkWeb. Click on the Admin button in the upper right hand corner. You can change the Default host value under the Datastore section:

http://www.splunk.com/assets/doc-images/30_admin11_definehost/host.jpg

via configuration files

This host assignment is written in inputs.conf during installation. You can modify the host entry by editing $SPLUNK_HOME/etc/bundles/local/inputs.conf.

This is the format of the host assignment in inputs.conf:

host = <string>
  * This is a shortcut for MetaData:Host = <string>. It sets the host of
    events from this input to be the specified string. "host::" is 
    automatically prepended to the value when this shortcut is used.

Set your own host value by changing the entry for <string>.

Define host assignment for an input

Use these instructions if you want to explicitly set a host value for all data coming in via a specific configured input. You can set host statically for every event in the same input, or dynamically using regex or segment on the full path of the source. If you need to assign a different host for different sources or sourcetypes in the same input, use the instructions to extract host per event instead.

Statically

This method will assign the same host for every event for the input.

Also, this will only impact new data coming in via the input. If you need to correct the host displayed in SplunkWeb for data that has already been indexed, you will need to tag hosts instead.

via SplunkWeb

Whenever you add a data input in SplunkWeb, you can set the host through the following interface:

http://www.splunk.com/assets/doc-images/30_admin11_definehost/sethost.jpg

Choose Constant value to assign a static value as host for each event that comes from your data source. Enter the value for host in the DNS name or IP address box.

via configuration files

You can edit inputs.conf to specify a host value. Include a host = attribute within the appropriate stanza in $SPLUNK_HOME/etc/bundles/local/inputs.conf:

[<inputtype>://<path>]
host = $YOUR_HOST
sourcetype = $YOUR_SOURCETYPE
source = $YOUR_SOURCE

Set up your inputs.conf stanza or edit existing ones that you have added through the CLI or SplunkWeb.

Example:

[tcp://10.1.1.10:9995]
host = webhead-1
sourcetype = access_common
source = //10.1.1.10/var/log/apache/access.log

This will set the host as "webhead-1" for any events coming from 10.1.1.10, on TCP port 9995.

Dynamically

Use this method if you want to extract the host name from a segment of the source input. For example, if you have an archived directory you want to index, and the name of each file in the directory contains relevant host information, you can use Splunk to extract this information and assign it to the host field.

via SplunkWeb

Follow the steps outlined above. However, instead of choosing Constant value, you can choose either:

Regex on path: Choose this option if you want to extract the host name via a regular expression. Enter the regular expression for host extraction in the regular expression box.

Segment in path: Choose this option if you want to extract the host name from a segment in your data source's path. Enter the segment number in the segment # box.

via configuration files

You can set up dynamic host extraction rules when you are configuring inputs.conf. You can add the following attribute/value pairs to override the host field.

host_regex = <regular expression>
    If specified, the batch monitor will use the specified regular expression
    to extract the host from the filename of each input. Specifically the first
    group of the regex is used as the host. If the regex fails to match, the
    "host =" attribute is used as the host.

host_segment = <integer>
    If specified, the batch monitor will use the specified '/' separated
    segment of the path as the host of each input. If the value is not an
    integer, or is less than 1, the "host =" attribute is used as the host.

Example:

[tail://apache/logs/]
host_segment = 3
sourcetype = access_common

This will extract the host name as the third segment in the path apache/logs.

Tag hosts

Tagging hosts is useful for knowledge capture and sharing, and for crafting more precise searches. Hosts can be tagged with one or more words describing their function or type, enabling users to easily search for all activity on a group of similar servers.

via SplunkWeb

You can use the drop down arrow next to the host field in SplunkWeb to tag your hosts. Choose Edit tags for this host.

http://www.splunk.com/assets/doc-images/30_admin11_taghostweb/host.jpg

Then, enter your tags, separated by commas:

http://www.splunk.com/assets/doc-images/30_admin11_taghostweb/hosttag.jpg

Host tags vs host names

Host name is extracted at indexing time. Host tags can be added to any host for additional information during searches. Each event can have only one host name, but multiple host tags.

For example, if your Splunk server is receiving compliance data from a specific host, tagging that host with compliance will help your compliance searches. With host tags, you can create a loose grouping of data without masking the underlying host name.

Extract host per event

Use these instructions if you want to override the default host name that is assigned to your events.

Configuration

You can explicitly configure the host name of any source or sourcetype via transforms.conf and props.conf.

transforms.conf

Add your custom stanza to $SPLUNK_HOME/etc/bundles/local/transforms.conf. You should configure your stanza as follows:

[$UNIQUE_STANZA_NAME]
DEST_KEY = MetaData:Host
REGEX = $YOUR_REGEX
FORMAT = host::$1

Fill in the stanza name and the regex fields with the correct values for your data.

Leave DEST_KEY = MetaData:Host to write a value to the host:: field. FORMAT = host::$1 writes the REGEX value into the host:: field.

Please note: You will need to name your stanza with a unique identifier (so it is not confused with a stanza in $SPLUNK_HOME/etc/bundles/default/transforms.conf).

props.conf

You will also need to create a stanza in $SPLUNK_HOME/etc/bundles/local/props.conf to map the transforms.conf regex to the source type in props.conf.

[<spec>]
TRANSFORMS-$name=$UNIQUE_STANZA_NAME

<spec> can be:
1. <sourcetype>, the sourcetype of an event
2. host::<host>, where <host> is the host for an event
3. source::<source>, where <source> is the source for an event

$name is whatever unique identifier you want to give to your transform.

$UNIQUE_STANZA_NAME must match the stanza name of the transform you just created in transforms.conf.

Please note: you can add any valid attribute/value pairs from props.conf when defining your stanza. This will assign the attributes for the <spec> you have set.

Example

The following logs contain the host in the third position.

41602046:53 accepted pearl
41602050:29 accepted swan
41602052:17 accepted pearl

Create a regex to extract the host value and add it to a new stanza in $SPLUNK_HOME/etc/bundles/local/transforms.conf:

[station]
DEST_KEY = MetaData:Host
REGEX = \s(\w*)$
FORMAT = host::$1

Now, link your transforms.conf stanza to $SPLUNK_HOME/etc/bundles/local/props.conf so your transforms are called. For example, the above transform works with the following stanza in props.conf:

[source::.../hatch.log]
TRANSFORMS-dharma=station
SHOULD_LINEMERGE = false

The above stanza has the additional attribute/value pair SHOULD_LINEMERGE = false. This specifies that Splunk should create new events at a newline.

Please note: you can add any additional attribute/value pairs from props.conf as needed.

The events now appear in SplunkWeb as the following:

http://www.splunk.com/assets/doc-images/30_admin11_overridehost/hostextract.jpg

Host name configuration examples

There are three .conf files which you can configure for setting the host:: value.

Setting host through inputs.conf

You can configure host name for any source by setting the host field in your inputs.conf stanza. For example:

[tcp://10.1.1.10:9995]
host = webhead-1
sourcetype = access_common
source = //10.1.1.10/var/log/apache/access.log

You can set host to be extracted via host_regex or host_segment. For example:

[tail:///mnt/logs]
host_segment = 3

Setting host through transforms.conf and props.conf

If you want to configure host on a per event basis, you can use transforms.conf in concert with props.conf to create a custom rule for host extraction. For example, if all your logs are coming to Splunk from a centralized server, and there is no way to tell by the file name, source or sourcetype what the host is, you may still be able to extract the host from the event itself.

For example, Splunk extracts the host from syslog data by using the following stanza in transforms.conf:

[syslog-host]
DEST_KEY = MetaData:Host
REGEX = :\d\d\s+(?:\d+\s+|(?:user|daemon|local.?)\.\w+\s+)*\[?(\w[\w\.\-]+)\]?\s
FORMAT = host::$1

The above works with the following stanza in props.conf:

[syslog]
pulldown_type = true
maxDist = 3
TIME_FORMAT = %b %d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 32
TRANSFORMS = syslog-host
REPORT-syslog = syslog-extractions
SHOULD_LINEMERGE = False
category = Logs:OSs:Unix

Source Types

How source types work

Sources and source types

A source is a file, stream, or other data input from which Splunk indexes events. For data coming from files and directories, the value of source is the full path, such as /archive/server1/var/log/messages.0 or /var/log/. The value of source for network-based data sources is the protocol and port, such as UDP:514. source:: is a core field that is indexed and stored with every event.

Source type refers to any common format of data produced by a group of sources. Sourcetype:: is also a core field that is indexed and stored with every event. It provides an easy way to find all sources that have the same kind of data but may have different source values because the data was accessed using a different input method, or because an application uses a different configuration or logging path on different systems. For example, you might look for all data of sourcetype::weblogic_stdout even though weblogic might be logging from two different domains in both source::/opt/bea92/user_projects/domains/petstore/servers/myserver/logs/myserver.log and source::/opt/bea92/user_projects/domains/invoicing/servers/myserver/logs/myserver.log.

Also, you may get linux syslog by tailing source::/var/log/messages on some servers while you are receiving direct syslog input from udp::514 from others and you'd like to be able to find both by searching for sourcetype::linux_syslog.

How sourcetype:: values are set

Automatic source type classification

During indexing, Splunk will try to classify source types automatically by calculating signatures for patterns in the first few thousand lines of any file or stream of network input. These signatures will pick up things like repeating patterns of words, punctuation patterns, line length, etc.

Once Splunk has calculated a signature, it compares the signature to previously seen signatures - if it's a radically new pattern, Splunk will establish a new name for the source type. These sourcetypes will be based on the filename of the first source of that pattern, prefixed by an underscore. Hence a file named error_log that appears to be a significantly new format to Splunk will be sourcetyped sourcetype::_error_log. If another file named error_log comes along that is radically new again and different from the log that Splunk sourcetyped as sourcetype::_error_log Splunk will qualify the new one as _error_log_u1.

However, when the next file or stream comes along that has a signature that Splunk sees as being very close to a signature it's already seen, it will re-use the sourcetype name it previously established, regardless of what the filename is. So if a file comes along called foobar it will be sourcetyped sourcetype::_error_log if Splunk determines the pattern is very similar to the first file named error_log.

If you want to configure your own automatic source-type recognition, you can use Splunk's rule-based source type feature. Rule-based source types are automatically assigned based on regular expressions you specify in props.conf. Learn more about how to configure rule-based source types.

Alias source types

Generally, Splunk's automatic source type classification will err on the side of being too granular rather than too generic. So it is likely that you will end up with some source types named sourcetype::_access_log from a farm of web servers with one web application, sourcetype::_access_log_u1 from a farm of web servers with another application because URL patterns common to each application vary. You may want to collapse both into something like sourcetype::access_common to reflect that they are all access logs in NCSA common format.

This is when you will want to create an alias for a source type.. However, note that aliasing source types is merely cosmetic so that events are grouped correctly in SplunkWeb and so that users can search for sourcetype:: values that make sense. If you have sourcetype-specific indexing properties or extracted fields (both discussed further down this page) you'll need to be sure that the actual stored sourcetype value is set correctly at index time through either training or manual source type assignment.

Train the sourcetype auto-classifier

Splunk's auto-classifier can be trained with a set of representative example files so that it is pre-loaded with sourcetype names that are more descriptive than sourcetype::_error_log-u13. If you train it with a wide enough range of files that you'd like to see given the same sourcetype name, it will both learn more good rules and unlearn bad rules, so the proportion of new files that are indexed of that sourcetype that are recognized will improve. Pre-training is how Splunk ships with the ability to assign sourcetype::syslog to most syslog files. While you can bypass Splunk's auto-clasification, skip the training step and simply hardcode a sourcetype for each data input, training may still be more effective if you plan to have Splunk index entire directories of mixed sourcetypes such as /var/log.

If Splunk fails to recognize a common format, or worse yet misclassifies it, we encourage you to report the problem and send us a sample file so we can improve the product. You can anonymize your file using Splunk's built in anonymizer too.

Learn how to train Splunk to recognize sourcetypes.

Hard-coded source type assignment

You can bypass automatic source type classification entirely and simply set a source type yourself when you configure a data input. See setting sourcetype for an input. However, this method is not very granular -- all will data from the same host or source will be assigned the same source type name.

If you need to give different sources with in a single directory input different names, you can try setting source type for a source.

Custom indexing linked to source types

You can tie custom indexing properties to any source type via props.conf. Just set the source type as the <spec> above a props.conf stanza.

Here are a few things you can do:

Tweak default processing

When Splunk indexes a data source, it automatically breaks the input into distinct events and extracts a host and timestamp for the event. The event boundaries, host, and timestamps are important for analysis. If Splunk is not setting the event boundaries or extracting timestamps and hosts correctly, you can easily modify these settings. See timestamp recognition, how host is assigned, and how events are recognized for more information.

Mask sensitive data

Your logs may contain sensitive personal data. For example, there may be social security numbers or passwords in your data that you may wish to cover up. You can create event configuration that masks sensitive data as it is being processed on input.

Change indexing density

When Splunk indexes data, it segments events via major and minor breakers. To save storage space on the indexer, you can Splunk's default segmentation settings. For example, web proxy logs may contain lengthy URLs that Splunk breaks into many different minor segments. You may wish to change this setting to eliminate unnecessary overhead.

Eliminate processing steps

Certain processing steps can be eliminated to provide faster indexing and better throughput. For example, if you don't need Splunk to search for timestamps within events, you can turn off timestamp extraction. You can also tune down or eliminate event type auto-discovery.

Field extraction linked to source types

You can also associate extracted field rules with source types. Like custom indexing properties, field extraction rules are based on the stored sourcetype:: value set at index time, so aliasing a source type won't cause it to pick up new rules. You'll need to either hardcode a correct sourcetype:: value for the source or input, or train Splunk.

Configuration files for source types

Source type can be set in inputs.conf. Custom indexing properties and rule-based associations of source types are configured through props.conf. Before manually modifying any configuration file, please read about bundle files.

Set source type for an input

Use these instructions to explicitly set a sourcetype for all data coming in via a specific configured input.

If you have a directory input (such as tailing /var/log/), this method assigns the