Documentation: 3.1
Print Version Contents
This page last updated: 11/06/07 11:11am

How Splunk Works

This section provides an overview of how Splunk works as an orientation for new administrators of Splunk deployments. Later sections of this manual provide step-by-step instructions for various configuration and management tasks.

About Splunk

Splunk is an IT Search engine. It is software that indexes any format of IT data from any source in real time, including logs, configurations, scripts, code, messages, traps, alerts, activity reports, stack traces and metrics from all of your applications, servers and devices. Splunk lets you search, navigate, alert and report on all your IT data in real time using an AJAX web interface.

Indexing

Splunk indexes data in real time. It accesses data using a variety of input methods, applies universal processing techniques to handle any format of IT data, and persists the original raw data along with indexes and additional fields added during processing. Data is persisted in one or more Splunk datastores using the file system.

Data access

Splunk can access data in a variety of ways:

  • tailing live files in specified paths or directories
  • watching for new or changed files in specified paths or directories
  • listening to any UDP or TCP network port for data (e.g. syslog)
  • executing any script and capturing the stdout

Read more about configuring Splunk for a variety of inputs.

SplunkBase offers several scripted inputs already developed by Splunk including access to remote databases over DBI, access to Checkpoint and other firewalls via the OPSEC LEA protocol, and operating system diagnostic commands such as vmstat and iostat. Read more about configuring scripted inputs.

Source and host identification

The input processors capture both the raw data, and the name of the source and host where the data originated. The data is then placed in a queue for processing. The default host is the host where the data was accessed, but Splunk can be configured to extract a host from a filename or path in the case where Splunk is indexing files that have been copied or mounted from other hosts.

Data processing

All data, regardless of input method, then passes through the same series of processors to enrich and organize the data prior to persisting it in the index. These processors are designed to learn to handle new source types automatically and on-the-fly, so that there is no need to create any type of adapter for particular log formats.

Source classification

First, Splunk classifies each source by comparing its pattern-based signature to sources Splunk has seen before. If the signature is close enough, Splunk will assign the source the same source type as very similar previously seen sources. If the signature is very different, Splunk will assign a new source type name. Source types are used both to enable users to search for all similar sources, and to allow Splunk to remember how to process specific sources in order to be more efficient when the same source is seen again.

Read more about sourcetypes, including how to set a sourcetype for a specific input.

Timestamp recognition

Next, Splunk tries to find timestamps in each line of data. It can automatically recognize timestamps in virtually any format. Lines without timestamps are assigned the timestamps of preceding lines. Sources without timestamps at all use the time of processing. Splunk records a normalized timestamp with each event.

Read more about timestamps, including how to configure Splunk to recognize custom timestamps.

Multiline handling

Splunk then looks for repeating patterns of lines within each source to determine whether the source is single-line, i.e. each event is one line. If the source is multiline, Splunk will find the likely event boundaries and treat all lines as a single event.

Punctuation and other attributes

Splunk extracts the event's punctuation pattern, to enable users to later find all events of similar structure. It also captures other attributes of the event including minute, hour, day of week, day of month, linecount and many other characteristics that will then be indexed and searchable.

Event type discovery

Splunk watches all events as they are being indexed. For each event, it mines combinations of terms and punctuation patterns that are likely to be effective classifications for helping users find like events and break down large search result sets. For example, Splunk will likely discover that the punctuation pattern for an event from a specific firewall, plus the word "deny," makes one good event type; and that the same punctuation pattern plus the word "accept" makes another. You can also set your own rules for event type recognition.

Segmentation

The final stage prior to indexing is to segment the event into searchable terms based on breaking characters. Segmentation can be configured depending on your data.

Custom processing

All events in the preceding processing steps are handled automatically based on default settings and learning that works for virtually any dataset. Additionally, Splunk can be configured for custom processing, as well as to disable or change the behavior of most default processing.

Any custom processing properties can be set to apply to particular hosts, sources or source types.

Host extraction

For data sources that centralize events from many different hosts or devices prior to being picked up by Splunk, the host where Splunk accessed the event may not be the true originating host. In that case, you can configure Splunk to extract a host value based on a pattern match in each event and use that to override the value of the host field prior to indexing. Splunk is preconfigured to do this for the syslog sourcetype.

Meta-events

Splunk lets you navigate and summarize related events by time and keywords on the fly. However, sometimes it's convenient to index all events of a transaction as a single searchable event. In that case, you can configure Splunk to create meta-events at index time based on timing and shared transaction ids.

Timezone offsets

Splunk will automatically normalize timestamps when there is a timezone in the timestamp of the event itself. If you are indexing a directory that has a mix of events forwarded outside of Splunk from a mix of hosts with different timezones, you can configure timezone offsets to be applied on a per host basis.

Custom indexed fields

Splunk indexes all of the terms in the entire event so you don't need to define any fields in order to be able to do freeform searches. However, if you want to be able to search for the value of a particular field by name and avoid false matches of the same string appearing elsewhere in events, you can match and index custom fields. For example, you may define a pattern to differentiate sourceip and destip values.

Custom indexed fields are also useful to implement granular access controls.

Filtering & masking

You may have some events that are not useful. You can configure Splunk to match and discard events based on patterns. You may also have elements within your events that are confidential or private and not necessary to Splunk users, such as social security numbers. You can configure Splunk to recognize and replace patterns within events.

Overriding automated processing

While Splunk's default behavior provides appropriate processing for most kinds of data, you may want to override this behavior to correct mistakes or in order to get faster indexing or more effective search for a particular dataset. For example, you may override the boundaries that Splunk determines for multiline events, or you may configure Splunk to index less densely to save space and increase indexing performance.

Data persistence

Once events are processed, Splunk builds a dense index on all of the terms in the raw events plus additional fields created during processing including timestamp, host, source, sourcetype and punct (short for punctuation.) These indexes are stored in the filesystem along with compressed files containing the original, raw data. Policies control data retention, archiving and retirement based on age or disk utilization. The filesystem-based datastore is compatible with write-once, read-many storage solutions as well as local disk, and SAN / NAS storage.

User interaction

Search, navigation & reporting

Users can perform freeform searches with booleans and wildcards using both the SplunkWeb and command line (CLI) interfaces. They can also summarize and analyze search results by piping them through advanced search language operators and by using SplunkWeb's interactive filtering, reporting and charting features.

Scheduling & alerting

Any search or report can be scheduled, and rules can be defined to alert based on the number and content of scheduled search results. Results and alerts can be sent via email, RSS, SMS, SNMP and syslog. Administrators can extend alerting options with custom alerting scripts.

Knowledge capture & sharing

Splunk is designed to capture the knowledge of its users on-the-fly and facilitate pooling of knowledge among different users. All of Splunk's knowledge capture features are designed to work at search time, and enrich already indexed data.

Saving searches, alerts & reports

Users can save any search or report and define alerts based on any saved search. They can choose to keep saved searches and reports to themselves, or share them with all other users.

Dashboards

Users can pull any saved search or report as well as other data onto an unlimited number of custom dashboards. Administrators can define starting dashboards for each user role.

Event types

Users can rename, tag or delete auto-discovered event types in order to make them more descriptive of their own data. They can also save any search as an event type definition. These changes reclassify already indexed data on-the-fly to make events more informative, and let users easily find all events sharing common characteristics.

Event types are shared amongst all users of a given Splunk server, so novices naturally benefit from the contributions of experts in particular datasources.

Different event typing and tagging schemes can co-exist harmoniously, so one team-member can classify events from a security perspective while another classifies the same events differently from an operational perspective.

Source type aliasing

Automatically assigned source type names can be aliased in the search interface to more friendly or accurate names. Aliases can also mask excessively granular source typing for users.

Host tagging

Hosts can be tagged with one or more words describing their function or type, enabling users to easily search for all activity on a group of similar servers.

Custom extracted fields

Splunk automatically extracts and enables users to search and filter on fields that are obviously discoverable in events. These extracted fields are found entirely at search time, so they don't require accurate parsing into a brittle schema. Administrators can extend this capability with custom extracted fields based on patterns they define. These extracted fields are applied at search time, so data doesn't need to be re-indexed to benefit from new field definitions.

Users and access controls

Splunk provides its own native authentication as well as integration with LDAP servers including Active Directory, OpenLDAP and Novell eDirectory. By default it has three standard roles - user, power and admin - and all users can search all data in a given Splunk server. Splunk 3.0 includes granular access controls, where different LDAP roles can be mapped to subsets of data based on values in indexed fields.

Distributed deployments

Splunk has been designed to support a variety of distributed deployment models to enable real-time centralization of data across hundreds or thousands of source hosts, applications and devices as well as to scale linearly and provide a centralized view across different data silos. There are two primary ways in which different Splunk servers communicate.

Data distribution

Splunk servers can route data to one another (as well as to other systems) in real time during indexing, so that data can be accessed on one host and sent to another host for indexing and search. Splunk servers can route data to load-balanced groups of other Splunk servers, to enable horizontal scaling via clustered indexing. Splunk servers can also clone data to multiple load-balanced groups of other Splunk servers to provide for data redundancy in high availability environments.

Distributed search

Splunk servers can be configured to distribute search requests to other Splunk servers and merge the results back to the user. This combines with load-balanced indexing to provide horizontal scaling to search and index hundreds of gigabytes or terabytes per day.

Distributed search also provides a good way to let select users correlate data across different data silos, even loosely coupling different deployments of Splunk that originated in different teams in the same organization.

Configuration

Splunk uses a modular, file-based configuration model that provides for easy upgrades, migration of configuration amongst different Splunk servers, and sharing of solutions between different deployments. It is designed to work well with 3rd party configuration management tools, while also offering its own robust central configuration management capabilities.

Configuration files

Splunk stores all of its configurations in files consisting of simple attribute-value pairs. This includes saved searches, alerts, reports and event types created by users, input and indexing configurations, distributed search configurations, data cloning and routing configurations, SplunkWeb interface settings and user preferences.

Bundles

All of these configuration files exist within directories that are known as bundles. The configurations in every bundle are merged to build the overall configuration for a given Splunk server. Bundles can be migrated between installations, and can update default configurations non-destructively.

Deployment server

In a distributed deployment, any Splunk server can be configured as a deployment server and can control the configuration of all other Splunk servers. Splunk servers that act as deployment clients can poll their deployment server periodically for new configuration, or IP multicast can be used. When new configuration is detected, they can pull the configuration from the deployment server. Servers with common configurations can be controlled as a single class, such as Splunk servers that are accessing the same files across hundreds of identical application server hosts.

Configuration interfaces

The SplunkWeb interface provides an admin section that provides for basic configuration of data inputs, distributed search, data routing, users and saved searches. The Splunk CLI provides command-based configuration with additional capabilities to control deployment servers and clients as well as index management.

Splunk 3.0 introduces a new, more powerful way of viewing and defining configuration details by enabling you to search your configuration files from the same search interface as you search indexed data. SplunkWeb allows administrators to edit configurations inline with search results. This approach is currently limited to select configuration elements but will be expanded in future releases to become the primary means of configuring Splunk.

SplunkBase

SplunkBase is a hosted service that is designed to be used by all members of the Splunk community in conjunction with their Splunk software to enable them to share and find solutions and pool their knowledge of IT data.

SplunkBase integrates with the Splunk software in the following ways:

  • Users can share, browse and download add-ons containing useful reports, dashboards, fields, searches, alerts and other useful content.
  • Users can look up events in their search results and find descriptions for event types that match their events. These descriptions link into a hierarchical knowledge base of information about different IT data sources and troubleshooting topics.

Developer APIs and integration features

Splunk's software is designed to be extensible by users and easy to integrate with other tools in the environment. A browser toolbar and permalink URLs let you launch Splunk from within other interfaces with contextually relevant searches. Search and management APIs let developers build their own interface or integrate Splunk searches into existing interfaces. A modules API lets developers extend Splunk's processing with custom C/C++ processors. The SplunkWeb interface is designed to be easily skinnable, customizable and embeddable. The Developer Guide provides details on using developer APIs and integration features.

Licensing

Splunk is a single software download that can run with different license levels. You can use Splunk for up to 7 days without a license. After that you can add either a free or enterprise license. Enterprise licenses are sold on a perpetual basis. We also offer 30-day trial Enterprise licenses.

Some features are limited to the enterprise license, including access controls, distributed search, receiving data routed from other Splunk servers and deployment server.

Licenses also include limits on the volume that can be indexed daily. When licenses expire, or if there are an excessive number of license volume violations, searches will be blocked and an error message will appear. However, we won't stop indexing your data, as we don't want you to lose critical data due to an unanticipated change in your environment. You can resume searching once the indexing volumes have returned within your volume limits or you apply a new license with a higher volume level or new expiration date.

Previous: Start searching    |    Next: About the deployment section

Comments

No comments have been submitted.

Log in to comment.