By Mike Mackrory
Modern cloud-native and microservice architectures have revolutionized how we build, deploy, and support online applications. We have access to a seemingly endless supply of resources and scale, and we can achieve redundancy and fault-tolerance with relative ease. There are innovative solutions at our fingertips, like system monitoring, observability, and log aggregation and searchability. It’s an exciting ecosystem that is only getting better.
However, all of this growth and opportunity comes at a price – and the price tag will make your finance department nervous. In this article, we’re going to take a deep dive into data resolution and the question of cost. How do you balance building observability into your systems and ensuring that you have access to enough data to identify and resolve problems without driving your organizations into the red? Keep reading to find out.
Observability, Data Resolution, Cardinality and Sampling
Before we immerse ourselves in balancing data resolution with sampling and retention, we should make sure that we are on the same page. First, we will define what observability means in this context, and then we will discuss data resolution, cardinality and sampling from that perspective.
When we use the term observability, we’re talking about determining the state of a system based on measurements that the system has externalized. For our applications, this can mean looking at metrics such as:
Percentage of computing power utilization
Available memory vs. allocated memory
The volume of network traffic being sent and received
Number of response codes indicating successful and unsuccessful requests
To fully harness the power of observability, we need to enhance those metrics by relating them to the users of our services. Real-user monitoring (RUM) gives us insights into how the raw metrics affect our users. RUM might indicate that a user is waiting several seconds for a request to return with a response, even though each of the services involved in processing the request may be within Service Level Agreement (SLA) limits. We can also include tracing data to identify bottlenecks and opportunities for improvement as the request is processed.
The observability of our systems depends on how we configure and manage the data the system is reporting. There is an art to finding the right balance between collecting too few data points and not getting an accurate view of the system, and collecting so many metrics that gathering, collecting and monitoring your applications affects performance and becomes overly burdensome. Cardinality, data resolution and sampling are critical aspects of finding this balance.
Cardinality is a measurement of the unique combinations of metric values and dimension values. Where metrics describe what we are measuring, dimensions describe who we are measuring, and allow us to make sense of the metrics. Adding an additional dimension increases the cardinality of our data. Increasing the cardinality increases the context and usefulness of the metrics we’re gathering, and factors into the balance between getting enough data and gathering too much data.
When we talk about data resolution, we’re referring to the density of the data collected. Data resolution is a measurement of the periods between each data point. Gathering data at a higher resolution (perhaps measured in milliseconds) produces very accurate and specific observability results. If we decrease the resolution (for example, if we take measurements every second, or even just a few times a minute), we end up with less accuracy, but we can still observe trends over time.
We can gather metrics less frequently, or we can use sampling to reduce the density of our dataset. With sampling, we take a representative percentage of the data and reduce its quantity while maintaining sufficient distribution to identify trends.
The Important of High Data Resolution
As we mentioned above, higher data resolution can yield more accurate results. More accurate results increase context and make it easier and faster to identify the root cause of incidents and resolve them. Suppose you’re comparing your observability data to other data sources, for example, and you’re looking for specific events to see how they affect the system. In this case, you would benefit from having access to the largest possible pool of data.
Given that more data can yield more accurate results and provide more specific insights into your system’s health, it’s understandable that you (as a practitioner) would want to ensure that you’re collecting metrics as frequently and as comprehensively as possible. Collecting and storing all data at the highest resolution is a concept called full-fidelity. In the context of RUM which we talked about above, full-fidelity allows you to identify and view each user’s journey for review and analysis.
Making a Smart Investment
Monitoring and observability require an investment in infrastructure, whether your own, or a third-party provider. By itself, the cost of a single measurement is negligible, when you multiply this cost by the adding of measurements within a system, and then multiply it across all your systems, it can add up to a significant line item in your organization’s IT budget.
The most important question to ask yourself and your organization is why you’re collecting monitoring and observability data in the first-place. Ultimately it’s to gain insights into how your systems are performing, and understanding that performance in the context of your users experiences. With that context, collecting and storing metrics at full-fidelity and unlimited cardinality is essential.
Balancing Data Resolution, Retention, and Your Budget
The key to an effective observability solution that doesn’t break the bank is to bring your data resolution needs, your retention policy, and your budget into balance. Fortunately, many providers allow you to configure a data resolution policy that lets you scale back the volume of retained data over time. For long term analysis, you can opt in to policies that average and consolidate metrics data for long-term retention, which shrinks the size of the dataset while maintaining enough statistically significant data to monitor trends and perform some level of analysis with older data.
Data retention policies also allow you to migrate older data from your observability platform to a more cost-effective location. These policies reduce storage costs while still allowing the system to move data back into the observability system if necessary.
Good communication and solid strategy are at the heart of an effective plan. As a technical leader within your organization, you can make a case for observability and recommend a strategy that reduces unnecessary expenditures while also meeting the needs of your engineers. By working with other leaders from finance, security, and risk management, you can find a solution that meets your organization’s needs.
If you’re anxious about going on this journey alone, you needn’t worry. Many third-party providers of observability software have extensive experience in helping organizations determine the right level of data resolution for their needs. The beauty of our industry, too, is that it encourages experimentation and feedback. Most providers won’t lock you into a long-term contract upfront. Instead, they’ll allow you to experiment and find the right plan for your organization while providing expert guidance and consulting based on industry best practices.
Recent developments in the observability space, including OpenTelemetry, help adoption by decreasing the overhead required by engineers to implement observability solutions in their code. Many popular open-source projects now include OpenTelemetry instrumentation. OpenTelemetry also protects projects from vendor lock-in by providing an open standard that can be shared with dominant providers in the space, and functionality to transform metrics into different formats.