
The Old Way: Element Managers & Health Checks
Prior to the rise of the cloud, infrastructure health was primarily understood through the spectrum of simplistic IT health checks. Tools like Nagios and HP OpenView would pull status updates from the various machines and devices across the network. They’d report when any server or switch failed to respond to a ping or behaved out of the ordinary, and the infrastructure team would respond accordingly.
As the hardware and software stack became more complex, operations teams relied on additional information to build a more complete view of the state of their IT. In addition to health status, specialized application testing, network management, and server monitoring tools helped with performance engineering and analysis at each layer of the stack.
While a collection of element managers could provide specific insight into events at the database or storage layer, for example, so-called Manager-of-Managers technologies like IBM Tivoli, BMC PATROL, and CA Unicenter became necessary to capture, correlate, and make sense of the abundance of operations data.
Operations teams used a Manager-of-Managers to determine when a problem was significant enough to page someone in the middle of the night. However, as infrastructure and applications shifted to elastic, distributed cloud environments, traditional element and systems managers began to fail under the increased variety of data and complexity of performance requirements. While pinpointing the location of a down server was the largest priority for the infrastructure team under the old regime, the ephemeral nature of modern infrastructure requires a more service-wide view of availability.
In an elastic environment, a series of alerts from a systems manager on host unavailability may be pure noise, due to a normal scale-down during low traffic periods, or because the service can handle individual node failures. Despite the ill fit for monitoring cloud environments, traditional monitoring remains one of the largest categories of spend in the systems management space.
Although element managers are able to send events and generate alerts when individual hosts encounter errors, they weren’t built for a service-wide view of the patterns and trends determining performance. Without analytics that aggregate metrics and provide a more dynamic view of performance relative to meaningful thresholds, even Manager-of-Managers systems are only monitoring at the surface of any environment. They don’t address the service-level monitoring required to operate more sophisticated architectures made up of open-source stateful services like Elasticsearch, message buses like Kafka, Docker containers, orchestration tools like Mesos, and AWS infrastructure.
“Got Nagios? Get rid of it. The problem is that the level of usability and sophistication of the product is pretty much zero... Nagios instances don’t auto-configure themselves, they don’t detect application instances properly or consistently, and configuration of checks is painful."
- Jonah Kowall, Gartner Blog Network, 22 February 2013
The New Way: Metrics Aggregation & Intelligent Alerting
Analytics on time series data underlies a modern approach to infrastructure monitoring and is key to ensuring availability of today’s distributed, elastic environments in production. Analytics help aggregate service-level metrics for a better way to explore performance than a component view alone.
Rather than just waiting to pull simple events or consolidate and analyze alerts from a variety of noisy element managers (as alert aggregation tools do), a more effective solution requires real-time alerts on the metrics that actually matter to your specific architecture. By computing and visualizing rates of change, percentiles, moving averages, or variance relative to historical benchmarks, you can isolate a pattern, measure its severity, and correlate the root cause with the trend you’re observing to prevent an issue before it affects availability.
By aggregating metrics and comparing against dynamic thresholds (rather than the static limits used by element managers), you can troubleshoot and triage problems at any level of the stack in real time. Dynamic thresholds allow you to compare metrics against a chosen benchmark that may change over time—for example, the historical norm for a given time of day and day of week. The ability to spot and fix even a subtle change in latency, load, or throughput as it emerges is key to proactively operating modern applications in the cloud. For the first time, you can determine the difference between a normal change, an anomaly, and a threatening pattern to get alerts and address issues before they turn into emergencies and affect the end-user experience.
Infrastructure monitoring built on analytics also helps eliminate the false-alarms and alert fatigue that can result from simplistic health checks. By using a push model, where metrics and their corresponding metadata are reported at a regular cadence to an analytics system, an administrator can build an alert that’s based on a dynamic query (e.g., alert any time a machine reporting itself as part of the login service has a CPU anomaly). Unlike other monitoring and management tools that require reconfiguration every time you change your environment, charts and alert rules created through dynamic queries automatically survive any and all updates.
With the real-time insight introduced by modern infrastructure monitoring, application developers, infrastructure engineers, and operations teams can collaborate across the entire application lifecycle for the first time.
For today’s distributed, elastic environments, infrastructure monitoring complements services like APM and log management by filling a large gap not previously addressed: intelligent and timely alerting on service-wide issues and trends within your production environment. Check out our new ebook to learn more about how SignalFx comes together with APM and logs for the most complete view of the application lifecycle in production.
Thanks,
Ryan Goldman