The journey to becoming cloud-native comes with great benefits but also brings challenges. One of these challenges is the volume of operational data from cloud-native deployments — data comes from the cloud infrastructure, ephemeral application components, user activity, and more. The increased number of data sources does not only increase datapoint volume – it also requires that monitoring systems store and query against data with higher cardinality than ever before.
In this post I’ll explain why this, review precisely what cardinality means in monitoring systems and discuss how Splunk Observability Cloud addresses the challenges with handling high cardinality metrics. I’ll also discuss the benefits of high cardinality metrics and why avoiding cardinality is done at your peril.
So What is Cardinality in Monitoring Systems?
You may remember from math class that cardinality is the number of elements in a set. When it pertains to monitoring, cardinality is the number of individual values of a metric. A simple example when monitoring an application containing only two HTTP methods, GET and POST, would result in the cardinality of 2. Support for an additional HTTP method (e.g. HEAD) would then increase the cardinality of this application to 3.
Each one of these metrics is stored in your monitoring system’s time-series database (TSDB) as a metric time series (MTS). Generally, a metric time series (MTS) is the unique combination of a metric name (HTTP Method) and any number of key-value pairs (GET, POST) known as dimensions or labels. Adding multiple labels creates MTSes that are considered high cardinality. This is because each combination of metric and dimension value is seen as a unique time series. Because of this, high cardinality allows for better monitoring by providing better filtering for captured metric time series.
Here is an example of what metrics look like without the use of dimensions or labels. Using the Splunk Observability Cloud Metric Finder, we can see in the example below that we have an overwhelming number of pod memory usage metrics. With over 1000+ pods deployed in our environment, this data can be way overwhelming and almost rendered useless without adding a label – i.e., cardinality – to these metrics.
Here are two examples of Kubernetes pod memory usage metrics. In the example above, you can see the data as-is with no labels. In the example below, adding dimensions (kubernetes_namespace and deployment) allowed us to quickly find high memory utilization in a particular time period of the example deployment.
What Use Cases Require Use of High Cardinality Metrics?
Labels that create cardinality let you slice and dice metrics by various groups. Here are some examples where high cardinality metrics come in handy. I’ve italicized the dimension that unlocks the value in each example:
- Understanding your customer experience: A great example here would be a set of APIs that you expose to users. Maintaining SLAs to customers for uptime and performance of those APIs would require monitoring request rate, errors, and duration for each of your services and whether performance varies by user.
- Planning for suitable capacity: Forecasting infrastructure capacity requires historical data on resource utilization for each service to see trends over long periods.
- Phased Deployments: Monitoring performance metrics during a canary or blue/green deployment.
- Immutable Infrastructure: Replacing every container for every code push with newly replaced containers with a common image that contains the new code changes. In this use case, each new container is identified by a unique time series, unless a label is added unifying the containers together.
Why Is This Important?
As infrastructure becomes cloud-native and development practices evolve, it is possible to have hundreds or even thousands of different services, sending upwards of millions of data points per second. With these evolving practices comes change in how we monitor. Unlike collecting only infrastructure metrics and looking at individual components (servers), organizations are also instrumenting their application workflows to provide visibility to critical microservices and monitoring based on similar groups (Kubernetes nodes, pods).
Using these non-standard tags and dimensions to slice and dice data causes high cardinality, but what are the benefits? Having this granularity adds power to your data and enables you to answer questions like:
- What products do my customers purchase the most?
- Are my Kubernetes containers evenly distributed across deployed Kubernetes nodes?
- How many 500 errors occurred during high traffic times?
- Does each of my microservices have adequate resources?
Here is an example of the value of these detailed metrics: answering how many successful item sale transactions have occurred within a 12 hour period. We can see from the example below that each customer tier (Gold, Silver, and Platinum) shows roughly about 120+ different successful item sale transactions. If the success rate was different by customer tier, we could further investigate. Without adding this information at the time of ingestion, we’d be unable to determine if such a difference existed.
Is High Cardinality Hard to Manage?
High cardinality metrics can be hard to manage due to the increase in the number of time series stored by your time-series database and the resultant need to make more queries, that are more complex in nature. With the increase in stored time series, any significant event (e.g., a code push, burst of user traffic) will result in a flood of simultaneous writes to the database as well.
The documentation for many monitoring systems actually warns users not to send in dimensions with a high number of potential values or to keep dimension values below a hard limit to avoid performance penalties which means that a more significant time series footprint leads to slower-loading charts, delayed alerts, and less reliable monitoring.
Splunk Observability Cloud is Built for High Cardinality Metrics
Unlike competitors, Splunk Observability Cloud allows you to query over 50,000 metric time series in a single job without incurring performance penalties. It is designed around several features aimed at reducing computational overhead when fetching data across high cardinality metrics, both in how queries are performed against metric time series and how they are stored. You will also never miss any metrics. With its NoSample™ Full-Fidelity Trace Ingestion and Retention, Splunk Observability Cloud collects all traces instead of a sampled subset, leaving no anomaly undetected. This also can be done with practically infinite cardinality, letting you get the insights that you need, when you need them.
Since databases tend to be optimized for either storage or retrieval or handling a particular type of data, most TSDBs impose limitations on aggregation, cardinality, and querying to preserve stability at the expense of performance.
Splunk Observability Cloud was designed with two separate backends for storing metric time series – one specifically optimized to handle metric values and one for human-readable metadata. Each backend is tailored to a specific use case (handling metadata is a search problem, while datapoint storage requires optimizing for bulk reads and writes) and scales independently of the other.
Better Query Performance
To help with query performance, Splunk Observability Cloud does not treat ‘source’ or ‘host’ dimensions as unique. Other TSDBs require users to specify a primary filter condition on source or host dimensions – any further filtering by tags happens on the result of that primary filter resulting in highly inefficient queries. Since there is no condition on source/host, searching just by a tag requires those systems to scan through all time series to find which ones are a match.
Splunk Observability Cloud treats all dimensions and tags the same, which means any search by any combination of dimensions is equally efficient and fast, resulting in better query response times and usability for environments with high cardinality metrics.
High cardinality metrics are becoming common as organizations move away from monitoring that mainly focuses on system metrics and more towards critical indicators like customer experience and application health. Using them allows you to take more action on your data and unlock valuable insights. The real value of observability comes from answering questions about your business, and cardinality is a requirement to do this. You owe it to yourself to make sure that your monitoring and observability platform can handle arbitrary cardinality and can grow with your business.
Why not try Splunk Observability Cloud yourself? You can sign up to start a free trial of the Splunk Observability suite of products – from Infrastructure Monitoring to APM, Real User Monitoring, and Log Observer. Get a real-time view of your infrastructure and start solving problems faster today.