Much like wine (😉), having data doesn’t mean you have quality data.
Today it's easier than ever to get data on almost anything. But that doesn’t mean that data is inherently good data, let alone information or knowledge that you can use. In many cases, bad data can be worse than no data, and it can easily lead to false conclusions.
So, how do you know that your data is reliable and productive? This is what we call data quality.
In this article, we'll go over the core aspects of data quality and things to consider when working with data quality. We won't get too technical — these concepts form a solid foundation to start growing a true data-driven approach.
What is data quality?
Before we go into detail, remember that data quality is ultimately about common sense. It can be easy to get lost in numbers and metrics, but the core idea is simple:
You want to know that your data is giving you a full, accurate picture of the real world in a usable form.
On a more technical level, data quality is a measure of how well data serves the purpose it's intended for. While there are countless metrics to use to evaluate data quality quantitatively, we can categorize them into a few vital elements of data quality:
Each of these elements needs to be checked and managed with planning, rules and metrics to ensure that data can be used properly without creating a false view of the situations it represents. This is why data quality is so important. Take these examples:
- Missing data can skew whole datasets toward the data that remains.
- Formatting issues can result in duplicated data points, or even parallel sets of data that devolve into conflicting conclusions.
- Multiple rounds of inconsistent data handling can completely change the conclusions you might draw from the data.
Data quality management needs to address all these possibilities and more in a complete and careful manner.
(Understand data lakes and data warehouses.)
Core elements of data quality
We can break data quality down into several core elements. One way to see into and better understand these elements is with data observability, which can power more efficient data pipelines and workflows.
We’ll use a simple example to illustrate these elements: You survey the ages of your customers to better understand who is using your product.
Does the data match the real world? Put simply, accuracy is the measure of whether the data is "correct" or not. For example, if customers lie about their age or make a typo, that data would not be accurate.
Is all relevant data included? This is a vital aspect to address during data collection. For example, if parents fill out the survey, data on how many of your customers are children may be missing.
Is the data the same wherever we look at it? Suppose you have two shops collecting this data. One enters ages as a word ("thirty-seven"), and the other as a number ("37"). If you then want to collect the data and do statistics, you'd have to reenter one data set or the other. Data consistency refers to a standard format and data collection methodology that avoids this kind of conflict by looking at:
- Data-entry rules
- Data normalization
Does the data stay the same over time? Once you collect the survey data, you'll likely be using it for a while to process and gain value from it. If it is reentered at some stage, and certain records aren't entered, its data integrity is compromised.
Is each data point collected only once? If the same customer fills out the survey over and over, they skew the data toward their age bracket.
Does the data make sense? This is similar to accuracy but refers to formatting and other aspects. For example, if an age is "3#", we can discard that data point because "#" is not a number.
Data quality vs. data integrity
Let's go into a little more detail about data integrity. When learning about data quality, it's common to see data integrity listed as an aspect of data quality. However, when you look up data integrity, the information about it often lists the same aspects as data quality. Certain sources may even define data integrity as larger in scope than data quality.
This is because data integrity is, loosely put, data quality over time.
After we collect and handle data, it rarely sits around collecting dust. We use, process and communicate data, often in several stages, across multiple entities, internal or external. Organizations may…
- Use varying parts of the data
- Add or remove data that isn't relevant to them
- Take other actions on some of the data
With these steps, data quality can vary. One team might change the data format to suit their system, or they might altogether discard certain data points. However, data quality remains just as relevant no matter what happens to the data.
Data integrity covers the "resilience" of data quality. It often covers aspects of data protection and data security — how well the data is handled and preserved — as well as the quality of the original data. That means: we can see integrity as part of data quality and vice versa. Data integrity deals with the same areas as mentioned above because these same areas are what need to be maintained during data-handling events.
Get started with data quality management
Once you get into serious data management, each of the above elements will have multiple metrics and key performance indicators (KPIs). These can include both the number of data points, as well as the number and results of various methods of verification and comparison. This overarching practice can be called data quality management.
Data collection usually comes with more than one variable (such as age, as in our example), meaning that these six elements must apply to numerous factors…and quickly get complicated. For this foundational article, however, these are great rules of thumb when getting started with data management of any kind — these serve as ways to know you can trust your metrics later:
- Data collection and sources. Consider what data you need for your goals. Decide what variables you're analyzing, then ask what sources you have available and how they cover this area. Then, consider the specifics of the sources, such as their formatting and what data they may be lacking. Employ checks or audits to ensure accuracy in data collection and address possible errors and issues.
- Data validation and monitoring. Consider if the data conforms to the expectations and standards set for that data. Ensure that all entries are formatted the same and have no duplicates in different formats. Consider if data points make sense for the format and data type.
- Timeliness. Spend some time considering when the data is needed, and what can be done to speed up and automate data quality management. This can be especially helpful for real-time data streams.
Wherever possible, pair these with quantifiable metrics to evaluate your data objectively. For example:
- To evaluate accuracy, compare data against reference datasets.
- To compensate for missing data, keep track of values discarded during data validation.
My favorite tip? Learn to view data quality as a combination of conceptual thought and objective evaluation!
Data quality is critical, but not all data is
Data is used in all industries today. It can be incredibly useful for drawing conclusions and providing understandable images of complex systems. However, precisely because these systems can be so complicated and vast, it's vital to ensure that the data you're using reflects the real world and is genuinely usable — in many cases, the data may be inconclusive or not usable at all and for any number of reasons.
Proper attention to data quality guarantees this by providing proper attention to the key components needed for data to be helpful and correct.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.