In our previous post, we highlighted the challenges our team faced in running two monitoring systems and how we can to the decision that Nagios was insufficient for monitoring our environment. In this post, we’ll cover the key issues addressed before consolidating our monitoring system and how we gained confidence in the alerts coming from our production environment.
Replacing Nagios Checks
The first step in the process towards consolidating our systems was the audit of alerts in Nagios and SignalFx. In many cases, we found that SignalFx alerts triggered much faster and more reliably than existing Nagios checks. This was the primary reason we had not noticed the duplicate alerts – we focused on the SignalFx alerts we received and addressed those conditions immediately, and therefore forgot these specific Nagios alerts existed until they arrived after the fact.
We did identify two types of alerts that so far were solely the provenance of Nagios:
- Ping checks to identify unresponsive hosts. Nagios pinged hosts in our system to verify that they were reachable through the network.
- Service checks to identify downed services. Nagios performed customized checks on configured services to verify that they were up and running properly, often based on plugins we wrote ourselves.
At first, these seemed difficult to replace. We had to switch from thinking of an active model where an external service accounted for individual hosts and services to a passive model where these hosts and services accounted for themselves.
In other words, since we rely on collectd plugins to report multiple metrics from individual hosts and services to our system, we needed to believe that relying on self-reporting was an accurate replacement of a centralized single point of failure system checking multiple services and hosts.
To make that conclusion with confidence, we needed to determine which self-reporting metrics were similar to the host ping checks and individual service checks from Nagios.
The easiest way to replicate host ping checks was to identify a “heartbeat” metric that each host reported on a regular basis. If this “heartbeat” stopped reporting for a certain amount of time, we could assume there was an issue with the host. While we couldn’t confidently verify that the host was dead (since just because it stopped reporting doesn’t mean it’s down), we could conclude that further investigation was needed.
We decided that
cpu.idle would be a good heartbeat reporter:
- It was a metric that every host is guaranteed to have as it’s a basic cpu metric.
- It was easy to configure as it’s reported by a built-in collectd plugin.
- It reports with whatever interval you set (in our case, we relied on the default of every 10 seconds).
- It doesn’t require much thought or analytics to interpret.
Accordingly, we set up a detector that would alert if a host stopped reporting
cpu.idle for a defined amount of time.
An example of an alert fired for a host that did not report for 9 minutes
However, to adequately replace Nagios, we also needed to account for a host being intentionally shut down or terminated.
Accounting For Intentional Downtime
One of the nicest features of Nagios is the ability to mute alerts for a defined period of time. Unfortunately, we didn’t do the best job of leveraging this feature since we relied on the UI rather than spending the time to learn the API and factor it into our infrastructure/developer tools. Consequently, we’d frequently forget to unmute the alert (or even set it in the first place) – we relied on people remembering this extra manual step outside of the regular maintenance process. It was critical not only to replicate this Nagios feature, but also figure out a better way to implement it within SignalFx.
We figured we could take advantage of SignalFx’s ability to tag metric dimensions via the API and filter out those tags in the detector charts we set up. In other words, if a host dimension was tagged with a marker like “host_terminated”, we excluded it from the alert. Since all our interaction with AWS hosts is via wrapper tools we wrote around AWS APIs, we figured we could just build calls to the SignalFx APIs into the wrapper tools to apply these tags.
Since we sync AWS’s own tags into SignalFx, we could also rely on AWS markers to help us filter intentionally downed hosts. We figured out the intentional downtime dilemma and built the work of putting a host in downtime into our actual operational practice. There were no extra steps required to put a host in downtime – we could put a host in downtime simply by running the tool we’d always run when we wanted to stop or terminate a host.
This is the rule for the alert as we set it up – fortunately it turned out to be simple!
Based on experience, we determined that 9 minutes of nonreporting was a good time to trigger an alert. We ran this in parallel with Nagios to see whether this was an adequate replacement for Nagios ping checks.
It turned out that not only was it adequate, it actually did a better job than Nagios ping checks. We received alerts for unintentionally downed hosts faster than Nagios ping checks reported. These alerts were always accurate because intentionally downed hosts were always filtered out automatically.
In addition, these alerts also let us know when a host dropped out of the system. Since these alerts trigger if the host’s reporting mechanism (ie. collectd) is down and not simply because a host itself is down, they also alert us to a situation which could hurt the accuracy of our analytics. So this alert ended up serving as more than a ping check.
Service Health Checks
Coming up with a parallel to Nagios service health checks was a bit more complex than coming up with a parallel to the ping checks. We needed to:
- Identify a “health heartbeat” metric similar to
cpu.idlefor each service
- Rewrite the collectd plugin as we had for so many customized plugins for Nagios
- Build tagging of metric dimensions to indicate intentional downtime to our service orchestration tools, just as we did with our AWS wrapper tools
This ended up taking a bit more configuration and thought than replacing the ping checks.
Customizing Plugins for Service Health Metrics
First, we had to identify a health metric for each particular service. For some services, we could rely on a simple
http return metric. For others, we could rely on parsing json results retrieved via an http port – for example, we publish health json via jolokia on all of our own services developed in Java.
Initially we tried using the collectd curl and curl_json plugins for these two types of metrics. However, we ended up writing our own collectd plugin that would check for both
http return and also parse json results. We wanted the ability to check for non-numerical json results and return a metric value based upon their value, and liked the idea of combining these capabilities into one plugin. Eventually, we expanded this plugin to also include TCP port checks. The plugin knows which type of check to perform based on the values given in each service’s collectd plugin configuration file.
Other services had their own health gauge, such as Zookeeper’s ‘ruok’ command. For these, we customized existing plugins to also execute this command and report a corresponding health metric for each possible result. (See here for how we gather the ‘ruok’-based metric.) For some services, like Cassandra, we determined our own health indicator (basically a simple Cassandra db query) and wrote a plugin that would gather and report a corresponding metric. For all of these, we created simple charts to track each metric.
An example of health metrics gathered across the environment for HTTP/JSON/TCP status
Intentional Downtime to Our Service Orchestration Tools
The complexity here was in figuring out what identifier to use so that our service orchestration system, MaestroNG, could tag a unique metric dimension to indicate a service was intentionally down. (More details on audit.py) Sometimes we run multiple services on multiple Docker containers on a single host. How could we accurately indicate that only one service was down on a host that was running two or more? Also, how could we identify which service on which host was down in cases where we were running, say, five separate instances of a service on five separate hosts?
We settled on creating a
plugin_instance dimension for each service instance whose name took the form:
e.g., an Elasticsearch instance running on host1.signalfx.com would have the
We added API calls from MaestroNG via an auditor script that would tag this
plugin_instance dimension with a tag indicating that a service was stopped or started.
With that, we were able to successfully integrate intentional downtime notification and removal of downtime into our existing tools and didn’t have to even think about accounting for an external system – it’s just part of our regular development and operational process.
We wrote a detector chart that tracked whether a service health metric stopped reporting for 9 minutes. It only accounted for services intentionally started and filter out services intentionally stopped (we also filtered out services running on hosts that were terminated to catch anything that might have gone down without being shut down by our orchestration system.) For example, here is our Zookeeper service detector:
Just as with the ping check replacement, we ran these service health check replacements in parallel with Nagios, and we got faster, more accurate results.
From Two To One
Since we were able to prove that our replacement self-reporting health metrics were equal (and in fact, better!) to Nagios checks, we felt confident to disconnect Nagios. Now we maintain and improve only one monitoring system.
In the next post, we’ll share the lessons learned from this process and the unexpected benefits of moving to one, consolidated monitoring system.