IT

The Daily Telegraf: Getting Started with Telegraf and Splunk

In this blog post, we discuss using Telegraf as your core metrics collection platform with the Splunk App for Infrastructure (SAI) version 2.0, the latest version of Splunk’s infrastructure monitoring app that was recently announced at Splunk .conf19.

This blog post assumes you already have some familiarity with Telegraf and Splunk. We provided steps and examples to make sense of everything along the way, and there are also links to resources for more advanced workflows and considerations.

What is Telegraf?

Telegraf is a metrics collection engine that runs on virtually any platform. It can collect metrics from virtually any source, and more inputs are being added pretty regularly. Most importantly, as of version 1.8.0, Telegraf can send metrics directly to your Splunk platform deployment.

Telegraf is a modular system that allows you to define inputs, processors, aggregators, serializers, and outputs. Inputs, as you would expect, are the sources of metrics. Processors and aggregators are internal methods that allow you to rename things, build internal aggregations, and define almost as many other user-defined customizations as you want. Serializers and outputs are where the magic happens: they define the format of the output data, and where and how to send it.

Version 1.8.0 includes a splunkmetric serializer. Configure the serializer to take metrics data from the Telegraf internal structure and format everything so it’s compatible with Splunk’s metrics format. You define the serializer in the [[output]] stanza. This lets you format your metrics in different ways for different destinations.

There are two ways to send metrics data from Telegraf to Splunk:

  • Use [[outputs.file]] and configure a Splunk Universal Forwarder
  • Use [[outputs.http]] and write to a Splunk HTTP Event Collector (HEC)
     

Configure the splunkmetrics Serializer

Before we talk about configuring outputs, we need to configure the splunkmetric serializer to properly format the metrics data before sending everything to Splunk.

To enable the splunkmetric serializer on a supported output configuration, set the following: data_format = “splunkmetric”

This configuration tells Telegraf that all metrics data the output sends will be in a Splunk-compatible format. The data format looks like this:

{
    "_value": 0.6,
    "cpu": "cpu0",
    "dc": "mobile",
    "metric_name": "cpu.usage_user",
    "user": "ronnocol",
    "time": 1529708430
}


Specifying the data_format works great for sending everything to either a Splunk Universal Forwarder or Heavy Forwarder. If you decide to send data to Splunk by writing to the HEC, you need to wrap the event in a bit of metadata. To tell Telegraf want to output a format that’s compatible with the HEC, set the following: splunkmetric_hec_routing = true

This setting modifies the JSON so that important fields such as time and host are in a wrapper around the event itself. The resulting data looks like this:

{
  "time": 1529708430,
  "event": "metric",
  "host": "patas-mbp",
  "fields": {
    "_value": 0.6,
    "cpu": "cpu0",
    "dc": "mobile",
    "metric_name": "cpu.usage_user",
    "user": "ronnocol"
  }
}


Now that we know how to enable the splunkmetric serializer for either output, and what the outputs look like for each configuration, let’s configure the output.

Send Data to Splunk

This is the output that Telegraf uses to write metrics data to a file. Configure your Splunk Universal Forwarder to monitor that file. It will work just like you’re monitoring some system log files.

The output stanza looks something like this:

[[outputs.file]]
   ## Files to write to, "stdout" is a specially handled file.
   files = ["/tmp/metrics.out"]
   ## Data format to output.
   data_format = "splunkmetric"


You’ll need to make sure to associate the output file with a metrics source type wherever you’re indexing your Splunk data. Create a props.conf stanza on your Splunk Indexer or Splunk Universal Forwarder, wherever you send the metrics data to first:

[telegraf]
category = Metrics
description = Telegraf Metrics
pulldown_type = 1
DATETIME_CONFIG =
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
disabled = false
INDEXED_EXTRACTIONS = json
KV_MODE = none
TIMESTAMP_FIELDS = time


Next you set the source type to telegraf in the inputs.conf stanza on the Splunk Universal Forwarder. If you’re going to use this type of configuration, you should also set up an appropriate log rotate policy to prevent your disks from filling up.

You could have an inputs.conf stanza that looks like this to process the metrics data file from Telegraf:

[monitor:///tmp/metrics.out]
index = telegraf_metrics
sourcetype = telegraf


Other ways to use this output are to have it output to stdout when launching Telegraf as a scripted input. This is what is done at TiVo. We’ll describe this in more detail in a future blog post.

Use [[outputs.http]] to Send Data to Splunk

This is the output that Telegraf uses to write metrics data to HEC. Configuring Telegraf to output directly to the HEC is not quite as straightforward as using the file-based outputs configuration because you have to deal with authentication using HEC tokens. Fortunately, the Telegraf HTTP output gives us the tools  we need to make this work.

Before starting down this road, you’re going to need a couple pieces of information from your Splunk administrator:

  • FQDN for the HEC endpoint
  • HEC token
     

This is what an [[outputs.http]] stanza should look like:

[[outputs.http]]
   url = "https://localhost:8088/services/collector"
   # insecure_skip_verify = false
   ## Data format to output.
   data_format = "splunkmetric"
    ## Provides time, index, source overrides for the HEC
   splunkmetric_hec_routing = true
   ## Additional HTTP headers
    [outputs.http.headers]
      Content-Type = "application/json"
      Authorization = "Splunk xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
      X-Splunk-Request-Channel = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"


We removed most of the comments from the stanza so we could focus on the important parts, but the HTTP output Telegraf provides has info about how to deal with HTTP basic authentication. The comments in the default Telegraf configuration file are exhaustive, and should help you configure settings for any type of security requirement you may have. Configure the HEC endpoint in the url stanza. Because we’re sending metrics events, you want to make sure you’re not using the raw endpoint.

You’ll need to set the following: data_format = “splunkmetric”

And then enable the HEC format by setting this: splunkmetric_hec_routing = true

The other important info you got from your Splunk administrator is the HEC token. In the example above, you would replace the strings of x’s with that HEC token. You’ll need to set this information in the [outputs.http.headers] stanza so Telegraf knows to attach this data in the headers of every event it sends to the HEC.

One of the nice things about outputting directly to the HEC is you can now perform metrics collection on systems where you don’t already or can’t run a Splunk Universal Forwarder, such as systems Splunk doesn’t support or small form factor computers like a Raspberry Pi.

Query Your New Metrics

Now that you have your data flowing into Splunk with either the HEC or a Splunk Universal Forwarder, you’ll want to be able to turn those metrics into usable eye candy.

When Splunk introduced the metrics store, they also add two (2) SPL commands to help you access the metrics data. The commands are mstats and mcatalog. While I don’t plan on making this post an exhaustive lesson on these commands, this example shows that drawing the CPU graph above is as simple as using some SPL:

| mstats sum(cpu.usage_idle) as usage_idle, sum(cpu.usage_iowait) as usage_iowait, sum(cpu.usage_irq) as usage_irq, sum(cpu.usage_nice) as usage_nice, sum(cpu.usage_softirq) as usage_softirq, sum(cpu.usage_steal) as usage_steal,  sum(cpu.usage_system) as usage_system, sum(cpu.usage_user) as usage_user WHERE cpu!="cpu-total" AND (index="telegraf" OR index="metrics") host=ronnocol.tivo.com span=30s
| timechart minspan=30s bins=2000 partial=f avg(usage_idle) as Idle, avg(usage_nice) as Nice, avg(usage_user) as User, avg(usage_irq) as Irq, avg(usage_softirq) as SoftIrq, avg(usage_iowait) as IoWait, avg(usage_steal) as Steal, avg(usage_system) as System


Monitor Metrics with Splunk App for Infrastructure (SAI)

The newest version of SAI, version 2.0.0, which Splunk announced at .conf19, includes Telegraf-specific entity discovery and dashboards. Telegraf is treated the same as other metrics collectors (e.g. collectd) in SAI. Entities are auto-discovered, appropriate graphs are drawn in the Entity Overview, and potentially interesting graphs are pre-populated in the Analysis Workspace. You can set alerts, groups, etc. with your Telegraf-based nodes just like you can with any of the other SAI- supported collection engines.

There’s only one modification that’s required to monitor Telegraf metrics with SAI: prepend all of telegraf metric names with “telegraf.”

Set the following in telegraf.conf:

[[processors.override]]
  name_prefix = "telegraf."


This allows SAI to know that the source of the metrics is Telegraf and to configure entity discovery and out of-the-box dashboards accordingly.

That’s it, that’s all there is to it. Now that Telegraf is prefixing all of the metric names with telegraf., your devices will show up in the SAI entities list. Those two lines provide you with wonderful prebuilt charts like these:


Sample Configuration

Here’s a sample config in use at TiVo to collect machine metrics with Telegraf and send them to Splunk for monitoring in SAI:

[global_tags]
 telegraf-profile = "sai-default"
[agent]
 interval = "30s"
 round_interval = true
 metric_batch_size = 1000
 metric_buffer_limit = 10000
 collection_jitter = "0s"
 flush_interval = "10s"
 flush_jitter = "0s"
 precision = ""
 debug = false
 quiet = true
 logfile = ""
 hostname = ""
 omit_hostname = false
[[outputs.file]]
  files = ["stdout"]
  data_format = "splunkmetric"
[[processors.override]]
 name_prefix = "telegraf."
[[inputs.cpu]]
 percpu = true
 totalcpu = true
 collect_cpu_time = false
 report_active = false
 fieldpass = ["usage_idle","usage_iowait","usage_irq","usage_nice","usage_softirq","usage_steal","usage_system","usage_user"]
[[inputs.disk]]
 ignore_fs = ["tmpfs", "devtmpfs", "devfs", "overlay", "aufs", "squashfs"]
 fielddrop = ["inodes*"]
[[inputs.diskio]]
[[inputs.kernel]]
 fielddrop = ["boot_time"]
[[inputs.mem]]
 fielddrop = ["high*","low*","huge_page*","commit*","dirty","inactive","wired"]
[[inputs.processes]]
[[inputs.swap]]
[[inputs.system]]
[[inputs.kernel_vmstat]]
   fieldpass = ["pgpgin", "pgpgout", "pswpin", "pswpout", "pgfault"]
[[inputs.net]]
 ignore_protocol_stats = true
[[inputs.netstat]]


Conclusion

Telegraf is a highly-configurable metrics collector that runs on a variety of platforms, collects metrics from a variety of sources, and allows you to use that data in Splunk. With the release of SAI 2.0.0, you can even get all of the great functionality that SAI provides for every other integration on your Telegraf nodes.

For further information, check out the following resources:

Help and Support

The telegraf integration with Splunk App for Infrastructure is supported as part of the open source Splunk metrics serializer project. For questions regarding setup and management of telegraf for sending data to Splunk please see the metrics serializer section of the telegraf project.

You can also ask any questions in the splunk-usergroups Slack workspace. Information about signing up can be found, here. Look for the #it-infra-monitoring channel. 


This post was written primarily by Lance O'Connor, Principal Architect at TiVo, and Nick Tankersley, Principal Product Manager at Splunk, tagged along for the ride.

Nick Tankersley
Posted by

Nick Tankersley

Nick Tankersley is a riddle inside a mystery wrapped in an enigma surrounded by a sudoku that didn't look that hard but has taken most of the flight even though the person next to you finished it in, like, 10 minutes. In addition to that nonsense he is also a Product Manager at Splunk responsible for IT solutions for Infrastructure, Cloud and New Stack Monitoring.

TAGS

The Daily Telegraf: Getting Started with Telegraf and Splunk

Show All Tags
Show Less Tags

Join the Discussion