
If you have ever used Splunk, you can probably come up with a number of reasons why you should use a Splunk forwarder whenever possible to send data to Splunk. To quickly illustrate some of the benefits, a Splunk forwarder maintains an internal index of where it left off when sending data. If for some reason the Splunk Indexer has to be taken offline, the forwarder can resume its task after the indexer is brought back up. Additionally, a forwarder can automatically load balance traffic between multiple Splunk indexers. There’s already a Splunk blog here devoted to getting data into Splunk that highlights a forwarder’s benefits that I encourage you to review.
But what if using a Splunk Forwarder is not an option due to some political or other reason. Well, let’s consider such a use case and its challenges.
Use Case
Company XYZ has already invested time and resources deploying Apache Flume throughout its infrastructure. Apache Flume is currently in use by many teams and they all claim it’s great at collecting and moving log data. Thus, your boss now insists on leveraging Flume to send the SMTP log to Splunk for analysis and visualization. But, that’s not without challenge.
Challenges
A Splunk sink does not ship by default with Apache Flume. The development teams tells you that any custom component of an Apache Flume agent can be developed. However, coding skills are limited within your team and the development team cannot take on more requests for now. Sounds familiar? Sigh.
Flume Data Flow
So, you’re on your own again having to design a solution that involves Flume. Before we embark on providing a solution, a short description of a Flume agent and its data flow is warranted.
An Apache Flume agent manages data flow from a point of origin to a destination. It’s a JVM process with a source component for consuming events with a specific format, a channel that holds the event, and a sink that forwards it to the next hop or destination. To learn more on Flume, you can visit this page, but this should be enough for this post.
Now, it’s time to explore our solution.
Solution
Do not panic as a solution that does not require development exists. Splunk is a universal machine data platform and does not impose hard requirements like other point solutions. We will therefore leverage the default Flume Thrift sink to send our SMTP log to Splunk after consuming it. The diagram below illustrates the process.
Configuration
Our configuration will involve three steps:
1) Configuring the Splunk input
A TCP input will be configured for accepting data from the Flume agent. Instructions for doing so can be found in the “Getting Data In” Splunk guide.
2) Creating a Flume agent configuration file
The sample configuration file provided below continuously reads the SMTP log file and sends new events via TCP to the Splunk indexer. In our example, the Flume agent is called Splunk.
##################################################
# Flume Configuration Example
# Sources, channels and sinks are defined per agent.
# In this example the Flume agent is called “splunk”,
# the sink “indexer”, the source is a tail of the SMTP log
# Defining sources, channels and sinks
# We will be reading a log file as source
# Our channel si-1 will hold data for the Splunk Indexer
splunk.sources = reader
splunk.channels = si-1
splunk.sinks = indexer
# For each one of the sources defined above, the type is defined
# The mail log will be read using the tail command
splunk.sources.reader.type = exec
splunk.sources.reader.command = tail -f /var/log/maillog
# Error is simply discarded, unless logStdErr=true
splunk.sources.reader.logStdErr = true
splunk.sources.reader.restart = true
# Memory channel si-1 will be used.
splunk.sources.reader.channels = si-1
# Each sink’s type must be defined with the following:
# Type: the default thrift sink is used
# Hostname or IP address of the sink, our Splunk indexer
# The IP address is 10.0.0.153
# The TCP port the Splunk indexer is configured to listen on
# in this case, port 1997
splunk.sinks.indexer.type = thrift
splunk.sinks.indexer.hostname = 10.0.0.153
splunk.sinks.indexer.port = 1997
# Specify the channel(s) the sink will use
splunk.sinks.indexer.channel = si-1
# Each channel’s type is defined
splunk.channels.si-1.type = memory
# Specify the capacity of the memory channel
# The maximum number of events stored in the channel
splunk.channels.si-1.capacity = 100
##################################################
3) Configuring line breaking in Splunk
Because some of the SMTP log events will come across merged, we need to configure line breaking in Splunk to properly handle the events. This is easily handled using the following regex “LINE_BREAKER = (\\x\w{2})+”, in our use case. We also used Splunk to set the appropriate host at input creation time as Splunk offers many options for setting parameters. Below is a screenshot of those SMTP events after being ingested and indexed in Splunk.
Conclusion
If you have a choice between using a Splunk forwarder and another method to get data into Splunk, always pick the forwarder for its ease of use, flexibility and various benefits. However, Splunk lets you implement other ingestion methods as well, thereby allowing you to conform to corporate policies and requirements without the need for an army of developers and consultants, and associated costs. So next time, you’re faced with a mandate to use Flume or another ingestion method instead of a forwarder, do not despair.
Happy Splunking!
----------------------------------------------------
Thanks!
Julian Andre