Turbo charging Modular Inputs with the HEC (HTTP Event Collector) Input

HTTP Event Collector (HEC)

Splunk 6.3 introduces a new high performance data input option for developers to send event data directly to Splunk over HTTP(s). This is called the HTTP Event Collector (HEC).

In a nutshell , the key features of HEC are :

  • Send data to Splunk via HTTP/HTTPS
  • Token based authentication
  • JSON payload grammar
  • Acknowledgment of sent events
  • Support for sending batches of events
  • Keep alive connections

A typical use case for HEC would be a developer wanting to send application events to Splunk directly from their code in a manner that is highly performant and scalable and alleviates having to write to a file that is monitored by a Universal Forwarder.

But I have another use case that involves using HEC to boost the performance and throughput capabilities of your Modular Inputs.

How Modular Inputs output data to Splunk

The default mechanism by which Modular Inputs get data into Splunk is to write it out to STDOUT whereby Splunk will read it from STDIN.

The data can be streamed out to Splunk in 2 modes , Simple or XML.

More detailed information on this output streaming  can be read here.

However , for high volume / high throughput Modular Inputs,  this approach simply does not perform well and typically you have to resort to deploying (n) Modular Inputs across (n) Forwarders in order to achieve your desired scale.

HEC to the rescue

But what if were to bypass the XML / STDOUT mechanism altogether and plugin a better performing output such as HEC.

Well it turns out that it is actually very simple to do this.Just requires a little bit of code and wiring up some Modular Input configuration parameters for the HEC settings.

Which Modular Inputs have I chosen

I have currently added HEC output options for the following Modular Inputs.

Why these ones ? Well these Modular Inputs are the ones that see the highest loads and require the most performant solution to deal with high volumes of events and daily data injestion.Ergo , first cabs off the rank.

My implementation in a nutshell

HTTP implementation

As these are Java based Modular Inputs I utilized Apache HTTPComponents to facilitate the HTTP communications logic for the HEC connection and event transport to Splunk. The particular component implemented is HTTPAsyncClient.

This facilitates the following core functionality :

  • connection pooling of HTTP(s) connections to Splunk
  • non blocking asynchronous HTTP(s)  operations
  • HTTP(s) Keep Alive on connections
  • HTTP and HTTPs  (although HTTP is recommended for better performance)

The JSON payload that gets POSTed to Splunk looks like :

{"event":"some test data","index":"main","source":"foo","sourcetype":"goo"}

Source , Sourcetype and Index Override

When you setup a HEC input in Splunk , you can specify the source,sourcetype and index for events received via HEC. However , when setting up your Modular Input , either by directly editing inputs.conf or via a setup page in SplunkWeb,  you can also specify the source,sourcetype and index. So I want to use the source,sourcetype and index from the Modular Inputs configuration. Luckily , these fields get passed to the Modular Input when it instantiates so  I am able to then incorporate them into the JSON event, as you can see above,  so they will be applied when the data is indexed in Splunk.

Enabling HEC for the Modular Input

This can be done via a manager setup page in SplunkWeb for the Modular Input stanza.The example below is taken from the Kafka Modular Input.hec

Or you can setup HEC by editing inputs.conf directly if you do not have a SplunkWeb UI ie: deploying on a Universal Forwarder.

Screen Shot 2015-09-08 at 10.24.21 PM

The token , port and https settings are whatever you have specified when you setup your HEC input in Splunk.

Screen Shot 2015-09-08 at 10.34.58 PM

Screen Shot 2015-09-08 at 10.35.37 PM

Batch Mode

HEC also supports sending batches of events to Splunk which is recommended for superior performance. So I have also implemented this and provided the necessary knobs and dials to allow you to configure this for your environment. The batch size and flushing is controlled by the first of 3 possible thresholds being met :

  • Data Size : send the batch of collected events to Splunk when this data size is reached
  • Event Count : send the batch of collected events to Splunk when this number of events is reached
  • Inactivity time : send the batch of collected events to Splunk after this time period of seeing no events arrive

The JSON payload for sending batches of events to Splunk is just single events concatenated together.

{"event":"some test data","index":"main","source":"foo","sourcetype":"goo"}{"event":"some test data","index":"main","source":"foo","sourcetype":"goo"}

More information

Using the HTTP Event Collector
HEC Developer Home
HEC Logging librarys , Java , .NET , Javascript
Modular Inputs code on Github
Developer Guidance

Damien Dallimore

Posted by