IT

Why the SignalFx Metric Proxy is Written in Go

 

Hi, my name is Jack Lindamood and I’m an engineer at SignalFx working on our open source integrations and ingest pipeline. SignalFx is an advanced monitoring platform built for modern applications, built on streaming analytics and alerting for time series data such as application metrics (“How many requests did my application get?”) and system level metrics (“How much network traffic is my Linux machine using?”). The data we consume from our users is both high-volume and high-resolution.

An early capability we wanted to offer users was to be able to add SignalFx to their existing metrics infrastructure with as little friction and disruption to their workflows as possible. To accomplish this, we created a metric proxy in Go.

What is the Metric Proxy

Meric_Proxy_in_Go_-_1

The metricproxy is an application that users can install in their environment to consume and direct the flow of metrics. It can pretend to be a metrics endpoint for Graphite, for example, and mirror metrics to both an existing Graphite installation and SignalFx. This allows Graphite users to try our product with minimal modification to their existing code.

The proxy is split into “readers” that consume data and “writers” that write out data. We don’t want a single misbehaving writer to impact others. In the above example, if a writer has issues communicating through their firewall to us, we wouldn’t want that to impact the flow of metrics to Graphite.

Besides allowing users to use their existing metrics infrastructure and pipeline, the proxy also serves as a way to batch metrics for transport. Metrics are generally reported from many (100s, 1000s, even 10’s of 1000s) sources simultaneously and some users use the proxy as a way to avoid having each source establish an independent connection to SignalFx’s ingest API to ship data. We want to make sure to achieve the right balance between writing metrics as fast as possible and bulk uploading those metrics to make as few HTTP requests as possible to writers.

Why Go? Channels!

The nature of this problem made it a great fit for Go, thanks to channels. As explained in Go By Example:

Channels are the pipes that connect concurrent goroutines. You can send values into channels from one goroutine and receive those values into another goroutine.

In the metric proxy, every metric becomes a message on a channel for each destination. A channel’s bounded size serves as a natural way to limit memory explosion due to misbehaving writers. Each writer uses goroutines to handle requests independently, without impacting each other. We use the default block of select from a channel as an easy way to drain a channel in bulk without waiting for a full buffer, as shown in this example code.

package main

import (
"fmt"
"math/rand"
"time"
)

type Metric string

// BulkUploadMessages will batch up to 10 messages from ch and send
// them to upload(). Rather than block for all 10 messages, it will
// call upload() directly with any number of Metrics if ch is empty.

func BulkUploadMessages(ch <-chan Metric) {
    maximumItemsPerPost := 10
    bulkPost := make([]Metric, 0, maximumItemsPerPost)
    for metric := range ch {
         bulkPost = append(bulkPost[:0], metric)
    outer:
          for len(bulkPost) < maximumItemsPerPost {
                select {
               case metric, ok := <-ch:
                     if !ok {
                           break outer
                     }
                     bulkPost = append(bulkPost, metric)
               default:
                     break outer
               }

 

Some other benefits to using Go:

  • Because we vendor (copy) dependencies in Go, everything needed to compile the proxy is hosted in a single location for users to install themselves.
  • Since Go can compile to static binaries, an executable for the proxy can be easily shared.
  • When we build Docker containers for our production environment, Go’s static builds give us lean containers FROM scratch.

The Result

The proxy has been reliably deployed with users for over a year. After the success our customers had with the proxy, we took the same code and started using it inside SignalFx to replace the Java code that routed and processed metrics received by our ingest API. Today, all incoming metrics (many many billions a day) into SignalFx are processed by Go. The initial rewrite into Go had our instances running at 74% idle CPU vs the previous 43% in Java. After performance tuning memory allocations, we are now running at 83% idle CPU on the same instances with significantly increased load.  

Meric_Proxy_in_Go_-_2

Advice

Early implementations of our code did not use vendoring, which made deployments fragile due to the lack of repeatable builds. Implementing vendoring in your binary repositories will save future headaches.

Go’s built in pprof package is very easy to use and a great place to start when thinking about code improvements. Especially useful for us was looking at the -alloc_objects parameter of pprof to dissect where memory allocations were occurring. Most allocation improvements involved reusing bytes.Buffer objects during string processing and using sync.Pool in a hot spot that involved creating a Protocol Buffer used during message encoding.

Taking advantage of static analysis tools in our build pipeline has helped maintain code quality as the code base has matured. We currently use go fmt, go vet, go lint, and gocyclo.  Developers can commit a list of ignores for lint and gocyclo to the codebase for explicit cases where the tool’s advice doesn’t make sense.

When running unit tests, we’ve discovered otherwise hidden race conditions by running our tests with go test -race -cpu 1,4.  Running unit tests with both a single and multiple cores often executes differently and bubbles up interesting race conditions.

And finally, to monitor Go we use the built-in runtime package to collect memory information and goroutine counts. The most important Go stats that we’ve found to monitor during a code push are runtime.NumGoroutine(), MemStats.Alloc, MemStats.TotalAlloc, and MemStats.PauseTotalNs.

Meric_Proxy_in_Go_-_3

 

Interested in working on these kinds of problems? We’re hiring engineers for every part of SignalFx! 

 

 

Posted by