Say goodbye to blind spots, guesswork, and swivel-chair monitoring. With Splunk Observability Cloud and AI Assistant, correlate all your metrics, logs, and traces automatically and in one place.
Key takeaways
Troubleshooting Kubernetes without the right metrics is like flying blind. Metrics, as quantifiable measurements or Key Performance Indicators (KPIs), are one of the four core telemetry types (metrics, logs, traces, and events) that form the backbone of Kubernetes observability.
Metrics provide essential visibility, enabling engineers to detect, diagnose, and prevent issues by offering measurable signals, instead of leaving you guessing and reacting only to symptoms.
This guide will break down the key Kubernetes metrics that matter for troubleshooting and demonstrate how to interpret them effectively to troubleshoot faster, avoid unnecessary downtime, and optimize your Kubernetes workloads.
Metric name | What it measures | Why it matters for troubleshooting |
API server latency | Time taken for control plane API calls | Detects control plane slowdowns affecting cluster responsiveness |
Node status | Node health and availability | Prevents workload disruption by detecting failing nodes early |
Pod restart count | Number of restarts due to failures | Identifies unstable workloads or misconfigurations |
Cross-cluster network latency | Delay in inter-service communication | Highlights networking issues impacting performance |
CPU/memory utilization | Resource usage by pods and nodes | Prevents throttling, evictions, and resource starvation |
HPA desired vs. actual replicas | Difference between desired and running pod count | Diagnoses autoscaling inefficiencies or resource starvation |
Pod eviction count | Pods removed due to capacity or policy | Surfaces resource pressure and scheduling problems |
Service discovery errors | Failures in workload-to-workload communication | Identifies misconfigurations or dependency failures |
Kubernetes environments are highly dynamic: pods are ephemeral, workloads shift between nodes, and autoscaling can change the shape of your cluster in seconds. All of this results in an overwhelming amount of data, much of it noise. Without metrics, you’re relying on instinct instead of evidence to find the metrics that matter.
Grouping metrics into logical categories helps you cut through the noise. The right metrics help you:
When users report slow application response times, a quick glance at Network and Service Performance metrics might show increased latency, while Resource Utilization metrics reveal a sudden spike in CPU on a particular node.
This correlation, facilitated by grouping, quickly points to a potential "noisy neighbor" (another workload on the same node that’s using excessive resources) or a runaway process, allowing you to prioritize the fix and avoid sifting through unrelated data.
With the value of grouping metrics in mind, this guide organizes key metrics into four foundational pillars:
Understanding these categories provides a structured approach to troubleshooting, effectively diagnosing issues, and maintaining a healthy cluster.
What they measure: These metrics provide a high-level view of the Kubernetes control plane's health and the readiness of your cluster's nodes. This includes API server availability, latency, and error rates, as well as node readiness and overall health.
Why they matter for troubleshooting: The control plane is the brain of your Kubernetes cluster; if it's unhealthy, every other part of the system is affected. Monitoring these metrics helps you quickly identify foundational issues that could impact all workloads.
How to use them for troubleshooting:
What they measure: These metrics provide insights into the operational state and lifecycle of individual pods and their workloads (containers). This includes readiness and liveness check statuses, the difference between desired and actual replica counts, container exit codes, and restart counts.
Why they matter for troubleshooting: These are often the earliest and most direct signals of a problem at the application or workload level, indicating issues like misconfigurations, resource exhaustion, or application bugs.
How to use them for troubleshooting:
What they measure: These metrics track the consumption and allocation of computing resources (CPU, memory, network utilization, and disk) at both the node and pod levels.
Why they matter for troubleshooting: Resource management is crucial in Kubernetes. Overutilization can lead to performance degradation, throttling, and pod evictions, while underutilization wastes valuable resources and increases costs.
How to use them for troubleshooting:
What they measure: These metrics focus on the communication flow within and across your Kubernetes services. They include service request rates, latency, error rates, DNS resolution times, and cross-cluster connectivity.
Why they matter for troubleshooting: Network bottlenecks and communication failures often masquerade as application issues, making them challenging to diagnose without specific network-level visibility.
How to use them for troubleshooting:
The RED methodology is a gold standard for monitoring Kubernetes workloads:
Tracking RED metrics in Kubernetes workloads can surface performance regressions before they trigger user-visible issues. For example, a sudden spike in latency combined with a rise in error rates may indicate downstream issues, like database slowness or resource contention on the node.
Automatic scaling is a fundamental feature of Kubernetes, allowing your applications to adapt to changing loads. However, without proper monitoring, tuning autoscaling can become guesswork, leading to performance problems or resource waste. By tracking specific metrics, you can ensure your scaling mechanisms are working efficiently and diagnose issues quickly.
Here are the key metrics to monitor for effective Kubernetes autoscaling:
The HPA automatically scales the number of pods in a deployment or ReplicaSet based on observed CPU utilization or other select metrics.
The VPA automatically adjusts the CPU and memory requests and limits for containers in a pod.
CPU and memory requests vs. limits: Mismatches between requested and limited resources can lead to problems.
The Cluster autoscaler automatically adjusts the number of nodes in your cluster.
Resource contention can cause cascading failures across Kubernetes workloads. Key metrics to watch here include:
Pair these resource metrics with throughput and latency metrics to understand the full impact of resource contention on user-facing services.
If you run workloads across multiple clouds, provider differences can complicate troubleshooting. Essential metrics include:
Consistency is key: Standardize metric names and tagging to make cross-environment comparisons meaningful. Use the OpenTelemetry metrics semantics conventions as your guide.
Metrics aren’t just numbers, they’re the starting point for root cause analysis that supports the entire environment. Here’s how to integrate them into your overall troubleshooting workflows:
Ultimately, a robust, metric-first observability strategy transforms Kubernetes troubleshooting from reactive firefighting into proactive stability management. By consistently tracking the right signals, from control plane health and resource utilization to RED metrics, engineers can detect problems sooner, diagnose them faster, and prevent recurrence.
This approach is crucial no matter if you’re managing a single on-prem cluster or a fleet of multicloud deployments. Using a smart metric strategy ensures your workloads remain healthy and your users stay happy.
Ready to harness the power of observability in your K8s environments?
API server latency, node status, pod restarts, resource utilization, and network latency are among the top metrics to monitor for diagnosing cluster health and workload issues.
Grouped and correlated metrics help link symptoms (like latency or evictions) to underlying causes such as node failures, resource starvation, or misconfigurations.
RED stands for Rate, Errors, and Duration, critical metrics for measuring and alerting on application performance and detecting regressions before they impact users.
Metrics like desired vs. actual replicas and target CPU utilization reveal whether the HPA, VPA, or cluster autoscaler is effectively responding to load and scaling demands.
OpenTelemetry provides a consistent, open standard for collecting and tagging metrics, making them easier to filter and correlate across clusters and environments.
Over-tracking, using static thresholds, skipping correlation with logs/traces, and ignoring baselines can all reduce the effectiveness of metric-driven troubleshooting.
See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.
Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.