Skip to main content


Blog

How to Combine Log Data and Metrics for a Holistic System View

By Mike Mackrory

As engineering professionals, we support complex and intricate systems. Each application or service that we deploy becomes part of a much broader ecosystem. Traditionally, we’ve monitored our applications and our infrastructure through different methods, and observed the results via disparate views. 

Our software applications and their underlying infrastructure exist in a tightly integrated but always changing environment. Monitoring their performance in a consolidated dashboard can enable us to glean more meaningful insights and afford access to more actionable data. In this blog, we’re going to consider why and how to integrate application logs and infrastructure metrics for optimal system observability and support.

Log Analytics and Application Performance

When we consider how best to monitor our applications, we have two distinct sources of information. The first source is the logs that each application generates. Ideally, we want to transmit the logs from the host to a central log aggregator. Aggregating the logs is especially important in distributed systems where multiple instances of the same application may be in service concurrently. The host instances may also be ephemeral, existing as long as needed due to the system’s demands.Fortunately, we can tap into services like Kubernetes’ logging service, AWS Cloudwatch, FluentBit, and other mechanisms to ensure that logs can be readily consumed.

Log records give us insights into the events taking place within our application. We can gain insights into errors and warnings and the types and contents of requests that the application receives and processes. Log analytics also provides a window into the system's state and contextual information, such as timestamps and consumer identifiers.

Application Performance Monitoring (APM) metrics contain information specific to the application’s performance. APM metrics include:

  • Error counts categorized by type

  • Request counts and response durations

  • Security-related data

APM data is typically collected and displayed to visualize performance over time. Establishing baseline readings allows the APM system and operations personnel to generate trends and identify anomalies that require further analysis or response.

Infrastructure Observability and Monitoring

Applications require infrastructure on which to execute. Depending on how you deploy your application, this could be an on-prem server, a virtual instance in the cloud, a docker container, or a serverless compute service. The application consumes processing power, memory, and network bandwidth, among other resources.

Understanding the availability of those resources and the levels at which your application uses them is critical to ensure optimal performance.

Gathering these metrics and transmitting them all to one place allows you to view all your infrastructure through a unified pane, rapidly identifying infrastructure in danger of exceeding available resources or not efficiently using those resources. As with APM metrics, establishing baseline measurements and identifying anomalies is highly valuable with infrastructure metrics.

A Holistic Approach

None of the metrics and analytics we’ve listed so far exist in a vacuum. When your systems experience performance problems or consumers report issues, you need to look at each of those data sources holistically to identify the problem’s root cause. If, for example, your consumers say sluggish performance, the problem could be:

  • Insufficient infrastructure resources are available.

  • Degraded performance in the underlying infrastructure, such as databases or dependent services.

  • Memory leaks, under-provisioned thread pools, a bug in a new software release, or other application-related problems.

  • Security concerns such as bot activity or DDoS attacks.

  • Non-compliant requests or data-parsing problems.

The problem source could also be something entirely different, or a composite problem consisting of multiple factors. If you want to dramatically increase your ability to identify these types of issues and thus reduce your MTTD (Mean Time to Detection), you need a way to bring all of these metrics together.

The one common factor in each of these metrics is time. Log messages contain timestamps, systems measure and report APM metrics at points in time, and your infrastructure emits metrics over time. If you use a tool like Splunk Log Observer, you can collect and gather each of these metrics from which you can search, analyze and visualize the different components of the data set.

Finally, you might find yourself in a position where you can see the value of instrumenting your applications. However, you might find yourself blocked by financial constraints or worried about choosing the right approach and avoiding vendor lock-in. If this is the case, I’d recommend looking in OpenTelemetry. OpenTelemetry is a popular and rapidly growing open-source project that supports distributed tracing and other instrumentation across a wide range of languages and libraries..The project minimizes the work required by your engineers and supports multiple formats, eliminating concerns about vendor-lock-in via proprietary agents.

Learning More

Not only does this all sound good in theory, implementing such a solution doesn’t need to be complicated. If you’re not already using a tool like Splunk Observability Cloud, you can find out more about this integrated user experience by watching this demo video.

Once the tool is ingesting your logs, application performance metrics, and infrastructure metrics, you’ll want to invest some time setting up the process to convert your event logs to metric data points. Using trace ids, machine identifiers, and timestamps, you can build queries and extract data aggregations from your logs, APM, and infrastructure metrics. Simply point and click to simplify log investigation in your DevOps practice

With Splunk Log Observer, you can take your queries further and use them as a basis for new dashboards, as well as setting up alerts to forewarn your teams when potential problems begin to arise - all without having to learn a new query language. Some examples of cases that may require an automated notification include:

  • An increase in the number of specific error types, such as HTTP 400 and HTTP 500 types.

  • Dramatic changes in request rates; either an increase or a decrease.

  • Response times deviating by a defined percentage amount from an established baseline.

As you sort and explore for data based on what’s important, filter and watch for critical logs as they unfold through a live tail feature, start your free trial to bring the power of Splunk logging to SREs, DevOps engineers and developers with Splunk Log Observer.

Learn more about Splunk Observability Cloud