TIPS & TRICKS

AWS Firehose to Splunk - Two Easy Ways to Recover Those Failed Events

With Kinesis Firehose being Splunk’s preferred option when collecting logs at scale from AWS Cloudwatch Logs, we’ve seen plenty of posts on setting this up, automation and examples on transforming event content. But what about when things go wrong?

When Kinesis Firehose fails to write to Splunk via HEC (due to a connection timeout, HEC token issues or other connectivity issues), it will eventually write its logs into a “splashback” S3 bucket to ensure that there is no loss of data. However, if you wish to retry sending the contents of the logs in the bucket back into Splunk you will note that the log contents that are written to the “splashback” bucket from Firehose are wrapped in JSON with additional information about the failure and the original message is base64 encoded. 

This makes re-ingesting these “failed” logs a little more complex than simply using Splunk Add-On for AWS for instance, as it would not be possible to decode the contents of the message directly into Splunk. Also, note that Firehose cannot ingest directly from S3.

This blog describes two simple options of re-ingesting these logs using Lambda functions:

  1. employing a route using the Splunk Add-On for AWS 
  2. the other sending the messages back into a Firehose data stream. 

These solutions can work with both Splunk Enterprise (on-premise or in your Cloud) and Splunk Cloud.

The Splunk Add-On for AWS route

Splunk Add-On for AWS

The main component of this solution is a simple Lambda function that allows an ingest process to be possible with the Add-On. The function, once set up, is triggered when objects containing the failed logs from Firehose are written to the S3 bucket. The function reads the contents of the object, extracting and decoding the “raw content” that was attempted to be sent via HEC, then writing the output back into S3. It is written back to the same bucket, but as an object prefixed with SplashbackRawFailed/

These objects can then be ingested by the Splunk Add-On for AWS using the standard inputs and configuration for S3 ingest - we would recommend using the SQS-based S3 input.

So the flow of data, as shown in the above diagram for a “failed” scenario is as follows:

  1. Initial logs generated and written to a CloudWatch log group. Firehose, with a subscription filter, pulls this into Kinesis. Optionally, (1b), a Lambda function does some processing/transformation on the log events. 
  2. Firehose attempts to write a batch of events to Splunk via HEC. For this example, there’s a failure to connect.
  3. After a retry and timeout period, the failed events are written to the “splashback” S3 bucket.
  4. With an Object “Put” notification from S3, the Lambda Function is triggered, and reads the failed events from the object, decodes the content and writes the events back into S3 in the original format.
  5. S3 Object “Put” notification is sent to SNS and subsequently into an SQS subscription.
  6. SQS based S3 Input on Add-On for AWS reads the logs from the S3 object and writes to Splunk. (The Add-On would usually run either on a Heavy Forwarder or Inputs Data Manager in Splunk Cloud)

The Firehose Re-ingest route

This solution is very similar to the previous method and uses a Lambda function to read from the S3 “splashback” bucket. However, rather than writing the output into S3, the function writes back into a Kinesis Firehose data stream. The advantage of this method over the first is that the data collection method into Splunk doesn't change, and no Add-On configuration is required. 

Firehose re-ingest

For this method, although technically it would be possible to re-ingest back into the same Firehose, a separate dedicated “re-ingest” Firehose data stream is recommended. This has two advantages: it could add the option to send the events into a separate Splunk HEC token input (or even instance), and it can also provide a “generic” retry capability for any Firehose. (note that the sample code provides this generic approach). 

The flow of data, as shown in the above diagram for a “failed” scenario is as follows:

  1. Initial logs generated and written to a CloudWatch log group. Firehose, with a subscription filter, pulls this into Kinesis. Optionally, (1b), a Lambda function does some processing/transformation on the log events. 
  2. Firehose attempts to write a batch of events to Splunk via HEC. For this example, there’s a failure to connect.
  3. After a retry and timeout period, the failed events are written to the “splashback” S3 bucket.
  4. With an Object “Put” notification from S3, the Lambda Function is triggered, and reads the failed events from the object, decodes the content. 
  5. The function writes the events back into a Retry Firehose (a separate Firehose data stream is not shown on the diagram).
  6. Firehose connectivity to Splunk hopefully recovered. If not, this will loop back into the Retry Firehose, following the same process again until the number of retry attempts has been exceeded. (This will eventually result in the messages being sent to the original S3 bucket with a prefix of SplashbackRawFailed/ as per the 1st solution)

This solution is the recommended option, although it should be noted that if there is a very prolonged period of disconnect between Firehose and Splunk HEC, the volume of re-ingest and therefore data load on the retry Firehose may be significant and beyond a single firehose’s capacity. This will be unlikely in most cases, as disconnects (especially to Splunk Cloud) are very unlikely to last very long. The example function provides a “timeout” mechanism for looping re-tries (max 9 attempts which could be up to 18 hours) - this prevents a continuous looping scenario where there is a total loss of connectivity to Splunk. In the event of a full time-out, the events are eventually written (not encoded) to S3 in the same method as the first option. 

Full details of the setup instructions and the source code for the sample Lambda functions can be found here: https://github.com/pauld-splunk/aws-splunk-firehose-error-reingest

Happy Splunking!

Paul Davies
Posted by

Paul Davies

Paul is an Architect in EMEA, responsible for working closely with Splunk customers and partners to help them deliver solutions relating to ingesting data or running Splunk in the cloud. Previously, Paul worked at Cisco as a BDM for big data solutions, an Enterprise Architect at Oracle, and Consultant at Hitachi.

TAGS

AWS Firehose to Splunk - Two Easy Ways to Recover Those Failed Events

Show All Tags
Show Less Tags

Join the Discussion