IT

A Pattern for Optimizing Go

As a follow up to my post on writing the SignalFx metric proxy in Go , this blog post will look at how we go about optimizing Go code through a real life example with our production ingest service. This is a generalizable pattern that we follow to optimize various parts of our codebase.

SignalFx is an advanced monitoring platform built for modern applications, built on streaming analytics and alerting for time series data such as application metrics (“How many requests did my application get?”) and system level metrics (“How much network traffic is my Linux machine using?”). The data we consume from our users is both high-volume and high-resolution, going through our ingest service before hitting other parts of SignalFx.

gopherswrench

Image credit: From the Go docs

Step 1: Enable pprof

What is pprof?
Pprof is the standard way to profile go applications that comes built in to go. The pprof package can expose via HTTP both memory and CPU usage across your application, as well as the running command line and contention information.

How to enable pprof:
You can enable pprof by adding the line import _ “net/http/pprof” into your application and starting the default http server.  Another option, which is what we do at SignalFx in order to explicitly control where the handler is attached, is to enable specific pprof endpoints explicitly in your code with something like the following.

import "github.com/gorilla/mux"
import "net/http/pprof"
var handler *mux.Router
// ...
handler.PathPrefix("/debug/pprof/profile").HandlerFunc(pprof.Profile)
handler.PathPrefix("/debug/pprof/heap").HandlerFunc(pprof.Heap)


Step 2: Find Something to Optimize 

What to execute:

curl http://ingest58:6060/debug/pprof/profile > /tmp/ingest.profile
go tool pprof ingest /tmp/ingest.profile
(pprof) top7


What this does:
Go includes a local pprof tool to visualize the output of pprof. The endpoint /debug/pprof/profile will collect a 30 second profile. What I do above is first save that output to a local file, then run pprof on the saved output. Note that the last parameter of pprof can be a URL directly, rather than a local file. The command top7 will show the 7 functions that consume the most CPU.

Result:

12910ms of 24020ms total (53.75%)
 Dropped 481 nodes (cum <= 120.10ms)
 Showing top 30 nodes out of 275 (cum >= 160ms)
      flat  flat%   sum%        cum   cum%
    1110ms  4.62%  4.62%     2360ms  9.83%  runtime.mallocgc
     940ms  3.91%  8.53%     1450ms  6.04%  runtime.scanobject
     830ms  3.46% 11.99%      830ms  3.46%  runtime.futex
     800ms  3.33% 15.32%      800ms  3.33%  runtime.mSpan_Sweep.func1
     750ms  3.12% 18.44%      750ms  3.12%  runtime.cmpbody
     720ms  3.00% 21.44%      720ms  3.00%  runtime.xchg
     580ms  2.41% 23.86%      580ms  2.41%  runtime._ExternalCode


What this means:
None of these functions are directly my own code. However, mallocgc, scanobject, and mSpan_Sweep all point to CPU usage during garbage collection. I could go deeper into each of these functions, but rather than optimize Go’s garbage collector itself, it’s a better use of time to instead optimize my use of Go’s garbage collector. In this case, creation of objects on the heap.

Step 3: Investigate the Causes of GC

What to execute:

curl http://ingest58:6060/debug/pprof/heap > /tmp/heap.profile
go tool pprof -alloc_objects /tmp/ingest /tmp/heap.profile
(pprof) top3


What this does:
Note that the endpoint I download is a similar URL to the profiler above, but instead ends in /heap. This will give me a summary of heap usage on the machine. I again save it to a file to later compare. The parameter -alloc_objects will visualize the number of allocated objects across the lifetime of the application.

Result:

4964437929 of 7534904879 total (65.89%)
 Dropped 541 nodes (cum <= 37674524)
 Showing top 10 nodes out of 133 (cum >= 321426216)
      flat  flat%   sum%        cum   cum%
 853721355 11.33% 11.33%  859078341 11.40%  github.com/signalfuse/sfxgo/ingest/tsidcache/tsiddiskcache.(*DiskKey).EncodeOld
 702927011  9.33% 20.66%  702927011  9.33%  reflect.unsafe_New
 624715067  8.29% 28.95%  624715067  8.29%  github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition


What this means:
This means 11.33% of my object allocations happened during the function EncodeOld of the object DiskKey, which was expected behavior. However, it was very unexpected that Partition() would take 8.29% of my total allocations since it should just be doing basic math, so I will focus my search there.

Step 4: Discover Why Partitioner is Using so Much Memory

What to execute:

(pprof) list Partitioner.*Partition


What this does:
This prints the source lines of the function I’m interested in, along with counts inside the function of which statements caused heap allocations. This is one of many commands you can execute inside pprof. Another very useful one is peek to see callers and callees of functions. You can get a full list by typing “help” and experimenting with what you see.

Result:

Total: 11323262665
 ROUTINE ======================== github.com/signalfuse/sfxgo/ingest/bus/rawbus.(*Partitioner).Partition in /opt/jenkins/workspace/ingest/gopath/src/github.com/signalfuse/sfxgo/ingest/bus/rawbus/partitioner.go
 927405893  927405893 (flat, cum)  8.19% of Total
         .          .     64: if ringSize == 0 {
         .          .     65: return 0, ErrUnsetRingSize
         .          .     66: }
         .          .     67: var b [8]byte
         .          .     68: binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid))
 239971917  239971917     69: logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning")
         .          .     70: murmHash := murmur3.Sum32(b[:])
         .          .     71:
         .          .     72: // 34026 => 66
         .          .     73: setBits := uint(16)
         .          .     74: setSize := uint32(1 << setBits)
         .          .     75: shortHash := murmHash & (setSize - 1)
         .          .     76: smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize)
 687433976  687433976     77: logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition")
         .          .     78: return smallIndex, nil
         .          .     79:}
         .          .     80:


What this means:
This shows calling debug logging is causing variables to escape from the stack to the heap.  Since this debug logging isn’t strictly needed, I could just remove those lines. But first, let’s confirm this hypothesis. The function “logherd.Debug2” is a wrapper that looks like the following and exists to remove the heap overhead of logrus WithField objects in the event debug isn’t even enabled.

// Debug2 to logger 2 key/value pairs and message.  Intended to save the mem alloc that WithField creates
func Debug2(l *logrus.Logger, key string, val interface{}, key2 string, val2 interface{}, msg string) {
    if l.Level >= logrus.DebugLevel {
        l.WithField(key, val).WithField(key2, val2).Debug(msg)
    }
}

It appears from this pprof inspection that passing the integers into Debug2 is causing allocations, but let’s confirm this hypothesis.

Step 5: Discover Why Log Statements are Causing so Many Allocations

What to execute:

go build -gcflags='-m' . 2>&1 | grep partitioner.go


What this does:
The -m parameter to build will print to stderr optimization decisions made by the compiler. This includes if the compiler is able to allocate a variable on the stack or if it must move it to the heap. Go will move variables to the heap if the compiler can’t guarantee that a pointer to the variable won’t live somewhere else.

Result:

./partitioner.go:63: &k.ringSize escapes to heap
./partitioner.go:62: leaking param: k
./partitioner.go:70: message.Key escapes to heap
./partitioner.go:62: leaking param content: message
./partitioner.go:70: numPartitions escapes to heap
./partitioner.go:77: smallIndex escapes to heap
./partitioner.go:77: murmHash escapes to heap
./partitioner.go:77: shortHash escapes to heap
./partitioner.go:68: (*Partitioner).Partition b does not escape
./partitioner.go:71: (*Partitioner).Partition b does not escape


What this means:
The important parts here are on line 77 showing that smallIndex, murmHash, and shortHash all escape to the heap. Because the compiler is doing short lived heap allocations, I’m creating more objects on the heap than I really need.

Step 6: Benchmark Partition Function

What to write:

func BenchmarkPartition(b *testing.B) {
     r := rand.New(rand.NewSource(0))
     k := partitionPickingKey{}
     msg := sarama.ProducerMessage {
          Key: &k,
     }
     p := Partitioner{
          ringSize: 1024,
          ringName: "quantizer.ring",
     }
     num_partitions := int32(1024)
     for i := 0; i < b.N; i++ {
          k.tsid = r.Int63()
          part, err := p.Partition(&msg, num_partitions)
          if err != nil {
               panic("Error benchmarking")
          }
          if part < 0 || part >= num_partitions {
               panic("Bench failure")
          }
     }
}


What this does:
This benchmark simply creates B.N objects and does a basic check on the return value to make sure it’s not simply optimized away. Writing benchmarks are a recommended “before” step when programmers go on optimization hunts, to ensure you’re progressing in the correct direction.

Step 7: Benchmark Memory Allocations in Partition Function

What to execute:

go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition


What this does:
The bench argument to test runs benchmarks that match the regex “.” and -benchmem will also have benchmarks track heap usage on average per iteration.  By passing _NONE_ to -run, I am saving myself a bit of time by telling test to only run unit tests with the “_NONE_” string inside them. In other words, run none of the unit tests but all of the benchmarks.

Result:

PASS
BenchmarkPartition-8 10000000       202 ns/op      64 B/op       4 allocs/op


What this means:
Each loop of my benchmark took on average 202ns and, more importantly, allocated 4 objects per operation.

Step 8: Remove the Logging Statements

What to write:

@@ -66,7 +65,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i
       }
       var b [8]byte
       binary.LittleEndian.PutUint64(b[:], uint64(message.Key.(*partitionPickingKey).tsid))
-       logherd.Debug2(log, "key", message.Key, "numP", numPartitions, "Partitioning")
       murmHash := murmur3.Sum32(b[:])
       // 34026 => 66
@@ -74,7 +72,6 @@ func (k *Partitioner) Partition(message *sarama.ProducerMessage, numPartitions i
       setSize := uint32(1 << setBits)
       shortHash := murmHash & (setSize - 1)
       smallIndex := int32(shortHash) * int32(k.ringSize) / int32(setSize)
-       logherd.Debug3(log, "smallIndex", smallIndex, "murmHash", murmHash, "shortHash", shortHash, "Sending to partition")
       return smallIndex, nil
 }


What this does:
My fix here was to remove the log lines. I had added them during testing/debugging and they never got removed. In this case, just removing the log lines is a simple enough solution.

Step 9: Rerun Build to Evaluate if Variables Escape to Heap

What to execute: 

go build -gcflags='-m' . 2>&1 | grep partitioner.go

Result:

./partitioner.go:62: &k.ringSize escapes to heap
./partitioner.go:61: leaking param: k
./partitioner.go:61: (*Partitioner).Partition message does not escape
./partitioner.go:67: (*Partitioner).Partition b does not escape
./partitioner.go:68: (*Partitioner).Partition b does not escape


What this means:
Notice that smallIndex, murmHash, and shortHash no longer give messages about escaping to the heap.

Step 10: Rerun Benchmark to Investigate New Allocs Per Operation

What to execute:

go test -v -bench . -run=_NONE_ -benchmem BenchmarkPartition


Result:

PASS
BenchmarkPartition-8 30000000        40.5 ns/op       0 B/op       0 allocs/op
ok   github.com/signalfuse/sfxgo/ingest/bus/rawbus 1.267s


What this means:
Notice that now each operation takes only 40ns, but more importantly, there are no allocations per operation. Since it was heap usage I was trying to optimize, this is the important part for me.

Concluding Remarks

Pprof is a very useful tool to dissect performance issuing with go code. By combining this exploration with go’s built in benchmarks, you can get real numbers about how much code changes help or hurt. Unfortunately, performance regressions tend to creep up over time. This makes a next step, and exercise for the reader, to store benchmark results in a database so you can view performance per commit over time.

Update Oct 23, 2015– Jack presented a talk on this topic at the GoSF meetup. Slides embedded below.

 

References for Leaning More

Thanks,
Jack Lindamood

Splunk
Posted by

Splunk

TAGS

A Pattern for Optimizing Go

Show All Tags
Show Less Tags

Join the Discussion