IT

Splunk Metrics via Telegraf | Splunk

The Splunk Metrics Store offers users a highly scalable, blazingly fast way to ingest and search metrics across their environments. There are many ways of generating metrics and sending them to Splunk, including both the collectd and statd agents, but this post will focus on Telegraf as a means to achieve this. For more information on the Splunk Metrics Store and why you should be using it, check out "Metrics to the Max! Dramatic Performance Improvements for Monitoring and Alerting on Metrics Data."

Telegraf is a widely used agent—written in Go—for collecting, processing, aggregating, and writing metrics. It’s awesome and supports inputs from everything from SQL server to Minecraft. It's an entirely plugin-driven platform for collecting metrics. It's platform agnostic, with the capability to run on most commonly-run operating systems. This post will focus on running the platform on nix variants; a follow-up blog will focus on running Telegraf on Windows.

The following is based on the amazing work from the team at TiVo, especially Lance O'Connor. See this page on GitHub for much more information.

The design goals for Telegraf are to have a minimal memory footprint with a plugin system so that developers in the community can easily add support for collecting metrics.

Telegraf is plugin-driven and has the concept of four distinct plugins types:

  1. Input Plugins collect metrics from the system, services, or 3rd party APIs
  2. Processor Plugins transform, decorate, and/or filter metrics
  3. Aggregator Plugins create aggregate metrics (e.g. mean, min, max, quantiles, etc.)
  4. Output Plugins write metrics to various destinations

There are many benefits of Telegraf, including the fact that plugins are integrated into the core (this means no competing plugins for the same tech), native support for dimensions/tags, good Docker support and an active support community.

To install Telegraf into a nix host, follow these simple steps:

Yum Install go
 Yum Install def
Yum Install git
cd "$HOME/go/src/github.com/influxdata/telegraf”
make
From "$HOME/go/src/github.com/influxdata/telegraf”
./telegraf config > telegraf.conf
             // To generate a config
./telegraf --config telegraf.conf —test
            // Test it out
./telegraf --config telegraf.conf
            // Run it
./telegraf --config splunk.conf --input-filter cpu:mem --output-filter http
./telegraf --config splunk.conf --output-filter http

The Splunk Serializer!

This serializer formats and outputs the metric data in a format that can be consumed by a Splunk metrics index. It can be used to write to a file using the file output, or for sending metrics to a HEC using the standard telegraf HTTP output.

If you're using the HTTP output, this serializer knows how to batch the metrics so you don't end up with an HTTP POST per metric.

An example config to shoot metrics to the HTTP Event Collector would look like this:

[[outputs.http]]
   ## URL is the address to send metrics to
   url = "https://x.x.x.x:8088/services/collector"
   ## Timeout for HTTP message
   # timeout = "5s"
   ## HTTP method, one of: "POST" or "PUT"
   # method = "POST"
   ## HTTP Basic Auth credentials
   # username = "username"
   # password = "pa$$word"
   ## Optional TLS Config
   # tls_ca = "/etc/telegraf/ca.pem"
   # tls_cert = "/etc/telegraf/cert.pem"
   # tls_key = "/etc/telegraf/key.pem"
   ## Use TLS but skip chain & host verification
   # insecure_skip_verify = false
   ## Data format to output.
   ## Each data format has it's own unique set of configuration options, read
   ## more about them here:
   ## https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md
   data_format = "splunkmetric"
    ## Provides time, index, source overrides for the HEC
   splunkmetric_hec_routing = true
   ## Additional HTTP headers
    [outputs.http.headers]
   # Should be set manually to "application/json" for json data_format
      Content-Type = "application/json"
      Authorization = "Splunk f8xxxxd3-4xx1-4xx2-aeda-86xxxxxb36c"
      X-Splunk-Request-Channel = "f8xxxxx3-4xx1-4xx2-aeda-8xxxxxxx6c"

Then, look to customize some bits that differ from global Telegraf settings, such as setting the index you’d like to send a certain metric to:

[[inputs.cpu]]
percpu = false
totalcpu = true
     [inputs.cpu.tags]
     index = "cpu_metrics"

This setup will result in metrics that look like:

{
  "time": 1529708430,
  "event": "metric",
  "host": "patas-mbp",
  "fields": {
    "_value": 0.6,
    "cpu": "cpu0",
    "dc": "mobile",
    "metric_name": "cpu.usage_user",
    "user": "ronnocol"
  }
}

In this example, cpu, dc and user are dimensions of the one metric.

An alternative to using HEC is to output Telegraf metrics to file, using an output configuration such as:

# Send telegraf metrics to file(s)
[[outputs.file]]
   ## Files to write to, "stdout" is a specially handled file.
   files = ["/tmp/metrics.out"]
   ## Data format to output.
   ## Each data format has its own unique set of configuration options, read
   ## more about them here
 ##https://github.com/influxdata/telegraf/blob/master/docs/DATA_FORMATS_OUTPUT.md     data_format = "splunkmetric"
   hec_routing = false

Then, using the Splunk Universal Forwarder, you'll be able to tail this file and send the metrics direct to an indexer. A sample event using this configuration is as follows:

{
    "_value": 0.6,
    "cpu": "cpu0",
    "dc": "mobile",
    "metric_name": "cpu.usage_user",
    "user": "ronnocol",
    "time": 1529708430
}

Use this example props.conf to format the metrics correctly:

[telegraf]
category = Metrics
description = Telegraf Metrics
pulldown_type = 1
DATETIME_CONFIG =
NO_BINARY_CHECK = true
SHOULD_LINEMERGE = true
disabled = false
INDEXED_EXTRACTIONS = json
KV_MODE = none
TIMESTAMP_FIELDS = time
TIME_FORMAT = %s.%3N

If you are looking to leverage telegraf with The Splunk App for Infrastructure, the following updates to your telegraf.conf file will make system metrics 100% compatible with the App.

[global_tags]
[agent]
  interval = "10s"
  round_interval = true
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  collection_jitter = "0s"
  flush_interval = "10s"
  flush_jitter = "0s"
  precision = ""
  debug = false
  quiet = false
  logfile = ""
  hostname = ""
  omit_hostname = false
 [[outputs.file]]
   files = ["stdout", "/tmp/metrics.out"]
   data_format = "splunkmetric"
[[processors.rename]]
  [[processors.rename.replace]]
    field = "usage_idle"
    dest = "idle"
  [[processors.rename.replace]]
    field = "usage_interrupt"
    dest = "interrupt"
  [[processors.rename.replace]]
    field = "usage_nice"
    dest = "nice"
  [[processors.rename.replace]]
    field = "usage_softirq"
    dest = "softirq"
  [[processors.rename.replace]]
    field = "usage_steal"
    dest = "steal"
  [[processors.rename.replace]]
    field = "usage_system"
    dest = "system"
  [[processors.rename.replace]]
    field = "usage_user"
    dest = "user"
  [[processors.rename.replace]]
    field = "usage_wait"
    dest = "wait"
  [[processors.rename.replace]]
    field = "usage_guest"
    dest = "guest"
  [[processors.rename.replace]]
    field = "usage_guest_nice"
    dest = "guest_nice"
  [[processors.rename.replace]]
    field = "usage_iowait"
    dest = "wait"
  [[processors.rename.replace]]
    field = "usage_irq"
    dest = "interrupt"
  [[processors.rename.replace]]
    field = "io_time"
    dest = "io_time.io_time"
  [[processors.rename.replace]]
    field = "weighted_io_time"
    dest = "io_time.weighted_io_time"
  [[processors.rename.replace]]
    field = "read_time"
    dest = "time.read"
  [[processors.rename.replace]]
    field = "write_time"
    dest = "time.wrie"
  [[processors.rename.replace]]
    field = "reads"
    dest = "ops.read"
  [[processors.rename.replace]]
    field = "write"
    dest = "ops.write"
  [[processors.rename.replace]]
    field = "iops_in_progress"
    dest = "pending_operations"
  [[processors.rename.replace]]
    field = "read_bytes"
    dest = "octets.read"
  [[processors.rename.replace]]
    field = "write_bytes"
    dest = "octets.write"
  [[processors.rename.replace]]
    field = "bytes_recv"
    dest = "octets.rx"
  [[processors.rename.replace]]
    field = "bytes_sent"
    dest = "octets.tx"
  [[processors.rename.replace]]
    field = "drop_in"
    dest = "dropped.rx"
  [[processors.rename.replace]]
    field = "drop_out"
    dest = "dropped.tx"
  [[processors.rename.replace]]
    field = "err_in"
    dest = "errors.rx"
  [[processors.rename.replace]]
    field = "err_out"
    dest = "errors.tx"
  [[processors.rename.replace]]
    field = "packets_recv"
    dest = "packets.rx"
  [[processors.rename.replace]]
    field = "packets_sent"
    dest = "packets.tx"
  [[processors.rename.replace]]
    field = "load1"
    dest = "shortterm"
  [[processors.rename.replace]]
    field = "load5"
    dest = "midterm"
  [[processors.rename.replace]]
    field = "load15"
    dest = "longterm"
[[inputs.cpu]]
  percpu = true
[[inputs.disk]]
  name_override = "df"
[[inputs.diskio]]
  name_override = "disk"
[[inputs.mem]]
  name_override="memory"
[[inputs.system]]
  name_override="load"

Check out the Splunk App for Infrastructure, and shout out to Splunker Nick Tankersley for providing the renames.

Of course, you should also check out the new logs to metrics interface in Splunk Enterprise 7.2, as well as some of the other new capabilities to search metrics via the Metrics Workbench!

Simon O'Brien
Posted by

Simon O'Brien

I am a passionate Splunker, traveller, family man, cook, basketballer, social advocate and security professional. I have the best job in the world, and live in the best place in the world.