At Tapjoy we have a number of systems in place to monitor production deployments. Over time we’ve improvised a number of interesting applications out of necessity; most recently we leveraged our monitoring system to implement ‘deployed revision’ checks for our Go services.
Monitoring ‘deployed revision,’ or the revision of code running on any particular server at a given time, is critical to our deploy process. For example, this past June our engineering team has completed 420 successful deploys across over a dozen services. When an average of 19 deploys per working day are going out, it’s incredibly important to be able to verify functionality in a rigorous and efficient way. When a deploy is in its canary stage, developers need to be able to decide whether to ‘roll forward’ with the deploy (i.e. release to production) or to ‘roll back’ to the previous working revision. Being able to accurately determine the state of the servers running a particular application or facet is crucial in keeping the deployment experience running smoothly for engineers.
Tapjoy Deployment Process
When code has been reviewed and passed CI (Continuous Integration), we create a deploy in an internal Ruby on Rails ‘deployboard’ application. That deploy uses Chore, our open-sourced Ruby job system, to manage the completion of several deploy-related tasks. The most important of these for getting code onto the canaries is the slug build process. We use an internally built Ruby tool, Slugforge, for constructing slugs that can be deployed to production boxes and unpacked to install a given application. For Ruby on Rails apps that means the application code and any vendored dependencies; for Go, that means the application binary.
Once the slug is constructed it is deployed to the canary, a single server from the targeted cluster or facet for the deploy (we use server lookup groups, configured via Tass, to locate target servers). The goal of each deploy is to have the code be verifiable on the canary prior to rolling wide across the entire cluster. This allows us to have confidence that when code is deployed across our entire infrastructure it will work as intended.
The Case for Monitoring
Most of the time this system works very well. There are circumstances, however, where it becomes helpful to be able to tell exactly what version of code is running on a target server. Some scenarios in which a deploy might fail include network partitions, ssh failures (typically encountered during initial server bootstrap, prior to deployment keys being installed), or even subtle bugs that prevent application startup (applicable when using unicorn or grace, for example).
Until very recently, we used sensu to do all of our deployed revision monitoring. While the presentation of the information was not incredibly helpful, it was useful for letting our operations team know when there was a rogue server running old code. An example of the information provided is below:
Simply having this information and the host was not enough for making a decision, however. This presentation leaves it up to the reader to determine exactly what ‘81031262df’ means in comparison with ’22c036dd98,’ and whether or not it is ‘correct’ for this server to be running the former revision.
When we deployed our first high-traffic production Go application, we had to start over again with the deployed revision checks. The Go http package didn’t come with a built-in ‘info’ route which we could use to determine the code version, so we had to improvise. Eventually we settled on leveraging a tool we use for all kinds of business, server, and DevOps metric collection: SignalFx.
SignalFx as a Deployed Revision Library
One of the things all of our engineers love about SignalFx is how much more context a graph can provide compared to point-in-time information. Our goal was to take the deployed revision information and present it in such a way that it would be immediately obvious to the person viewing the chart if any action was necessary, or if the deployed revision difference was expected as part of a normal process (e.g. a canary). In short, we wanted something that would look like this:
At first glance, it might be tempting to simply have one metric per revision, and simply do a wildcard match. That way if we were reporting with a pattern `app.version.a632f0` we could match that SHA, and all future SHAs, with `app.version.*`.
Revision as a Dimension
One thing we aim to do is to provide as rich data as possible using SignalFx’s multi-dimensional data model (and avoid the obfuscation that plain metric wildcards sometimes bring with them). If we instead graphed the SHA as a dimension on the metric of version, the metric would always be `app.version` and the dimension would be the `sha` value. The code we eventually ended up implementing looked something like this:
This of course required using the recommended statsd settings so that old gauges would be deleted if not written to, but results in a graph like this during a normal deploy:
Our operations team (and we!) loved the new visualization. Even on a particularly busy day, it’s very easy to correlate any deployed revision mismatches with deploys to determine that all of them were normal:
But how well does our solution handle when things aren’t going along as planned? Here’s an example graph from a day when we had a few servers get scaled in (no deploys were going out during the ‘zoomed in’ time) on an old revision:
With this new method of displaying deployed revision information, it’s possible, at a glance, to determine if further investigation is necessary or if the current state is expected.
Our amazing operations team has already extracted the Go portion of this instrumentation into a package that is now going to be standard across all future Tapjoy Go applications. As we continue to instrument our code and send metrics with rich dimension data off to SignalFx for graphing, we are excited about the opportunity to be creative with the way we present information to end users, whether they are engineers looking at performance metrics or product managers doing funnel analysis. We are looking forward to discovering what new visualizations our broader organization is able to implement using SignalFx.
This is a guest post by Dean Dieker of Tapjoy, originally published on the Tapjoy engineering blog.