For every modern application, infrastructure monitoring that aggregates metrics and focuses on time series analytics is essential to ensuring availability and performance in production. Infrastructure monitoring fills a large gap not previously addressed by logs (or APM): intelligent and timely alerting on service-wide issues and trends across the environment (whether in the cloud or on-prem, or a mix of legacy and new architectures).
Ultimately, the best DevOps strategy requires full visibility not only up and down the stack, but also across all stages of the application lifecycle. Logs alone do not provide the real-time insights and alerts that are required to ensure availability and performance when you don’t have the luxury of time.
To evaluate an issue in production, log management tools like Splunk and the Elastic Stack help operations teams explore all the details of an event and determine root cause after-the-fact. But the massive detail that logs provide can’t realistically be processed quickly enough to deliver the meaningful, proactive, and timely alerts that are needed to operate today’s distributed, scale-out environments.
The immense volume of log data generated by modern infrastructure o ers operations teams deep insight into the root cause of a systems problem. Logs are primarily unstructured data, typically in the form of message streams, that are emitted from applications as a detailed record of events. By auditing and exploring log data in the context of the application that created it, engineers can troubleshoot code or system bugs for deep evaluation of time- sequenced issues.
Analysts can also use logs to enrich other data sets to gain intelligence into machine and user interactions. Log data is used not only for server, network, and so ware troubleshooting, but also for regulatory and security compliance, forensics, and incident investigation.
However, logs are not particularly useful for alerting on real-time infrastructure issues across distributed environments. At the time of an emergency, an infrastructure monitoring solution provides the necessary service-level details to triage and remediate the issue.
Metrics are the best first line of defense when dealing with a problem. Streamed into an analytics-based monitoring solution, they help the viewer narrow down to the service and application causing problems in the most timely manner. Even more effectively, modern infrastructure monitoring can generate proactive alerts on patterns that foretell a mounting concern and provide the runway to isolate, assess, and address the underlying issue before a problem affects the end user.
Because logs are primarily unstructured data, they are well suited to batch data analysis of a discrete event. However, a big data approach to logs makes them poorly suited to the real-time search and stream processing required for timely alerts. The high volumes of disk I/O and network load needed for log exploration are much better aligned to post-hoc analysis, as opposed to the high metric throughput typical of a time series database used for infrastructure monitoring.
For cloud environments, whose goal is to scale infrastructure elastically, you need a purpose-built system focused on metrics and analytics. Real-time aggregation is a job not fit for batch analytics because alerting requires much faster, more flexible insights. Log analysis for deeper exploration and investigation is ultimately a great complement to an infrastructure monitoring solution that handles real-time analytics and alerting on time series data.
With the real-time insight introduced by modern infrastructure monitoring, application developers, infrastructure engineers, and operations teams can collaborate across the entire application lifecycle for the first time, from pre-production performance engineering through real-time service-level monitoring in production to post-mortem investigation of past issues.
Join our live weekly demo on cloud monitoring »