Today it's easier than ever to get data on almost anything. But that doesn’t mean that data is inherently good data, let alone information or knowledge that you can use. In many cases, bad data can be worse than no data. It can easily lead to false conclusions.
So, how do you know that your data is reliable and productive? This is what we call data quality.
In this article, we'll go over the core aspects of data quality and things to consider when working with data quality. We won't get too technical — these concepts form a solid foundation to start growing a true data-driven approach.
Before we go into detail, remember that data quality is ultimately about common sense. It can be easy to get lost in numbers and metrics, but the core idea is simple:
Data quality is the term to use when you know that your data is giving you a full, accurate picture of the real world in a usable form.
On a more technical level, data quality is a measure of how well data serves the purpose it's intended for. While there are countless metrics to use to evaluate data quality quantitatively, we can categorize them into a few vital elements or "dimensions" of data quality:
Each of these elements needs to be checked and managed with planning, rules and metrics to ensure that data can be used properly without creating a false view of the situations it represents. This is why data quality is so important. Take these examples:
Data quality management needs to address all these possibilities and more in a complete and careful manner.
(Related reading: data governance, data lifecycle management & data platforms.)
We can break data quality down into several core elements. One way to see into and better understand these elements is with data observability, which can power more efficient data pipelines and workflows.
We’ll use a simple example to illustrate these elements: You survey the ages of your customers to better understand who is using your product.
Does the data match the real world? Put simply, accuracy is the measure of whether the data is "correct" or not. For example, if customers lie about their age or make a typo, that data would not be accurate.
Is all relevant data included? This is a vital aspect to address during data collection. For example, if parents fill out the survey, data on how many of your customers are children may be missing.
Is the data the same wherever we look at it? Suppose you have two shops collecting this data. One enters ages as a word ("thirty-seven"), and the other as a number ("37"). If you then want to collect the data and do statistics, you'd have to reenter one data set or the other. Data consistency refers to a standard format and data collection methodology that avoids this kind of conflict by looking at:
Does the data stay the same over time? Once you collect the survey data, you'll likely be using it for a while to process and gain value from it. If it is reentered at some stage, and certain records aren't entered, its data integrity is compromised.
Is each data point collected only once? If the same customer fills out the survey over and over, they skew the data toward their age bracket.
Does the data make sense? This is similar to accuracy but refers to formatting and other aspects. For example, if an age is "3#", we can discard that data point because "#" is not a number.
Let's go into a little more detail about data integrity. When learning about data quality, it's common to see data integrity listed as an aspect of data quality. However, when you look up data integrity, the information about it often lists the same aspects as data quality. Certain sources may even define data integrity as larger in scope than data quality.
This is because data integrity is, loosely put, data quality over time.
After we collect and handle data, it rarely sits around collecting dust. We use, process and communicate data, often in several stages, across multiple entities, internal or external. Organizations may…
With these steps, data quality can vary. One team might change the data format to suit their system, or they might altogether discard certain data points. However, data quality remains just as relevant no matter what happens to the data.
Data integrity covers the "resilience" of data quality. It often covers aspects of data protection and data security — how well the data is handled and preserved — as well as the quality of the original data. That means: we can see integrity as part of data quality and vice versa. Data integrity deals with the same areas as mentioned above because these same areas are what need to be maintained during data-handling events.
Once you get into serious data management, each of the above elements will have multiple metrics and key performance indicators (KPIs). These can include both:
This overarching practice can be called data quality management.
Data collection usually comes with more than one variable (such as age, as in our example), meaning that these six dimensions must apply to numerous factors…and quickly get complicated. For this foundational article, however, these are great rules of thumb when getting started with data management of any kind — these serve as ways to know you can trust your metrics later:
Wherever possible, pair these with quantifiable metrics to evaluate your data objectively. For example:
My favorite tip? Learn to view data quality as a combination of conceptual thought and objective evaluation!
Data is used in all industries today. It can be incredibly useful for drawing conclusions and providing understandable images of complex systems. However, precisely because these systems can be so complicated and vast, it's vital to ensure that the data you're using reflects the real world and is genuinely usable — in many cases, the data may be inconclusive or not usable at all and for any number of reasons.
Proper attention to data quality guarantees this by providing proper attention to the key components needed for data to be helpful and correct.
See an error or have a suggestion? Please let us know by emailing ssg-blogs@splunk.com.
This posting does not necessarily represent Splunk's position, strategies or opinion.
The Splunk platform removes the barriers between data and action, empowering observability, IT and security teams to ensure their organizations are secure, resilient and innovative.
Founded in 2003, Splunk is a global company — with over 7,500 employees, Splunkers have received over 1,020 patents to date and availability in 21 regions around the world — and offers an open, extensible data platform that supports shared data across any environment so that all teams in an organization can get end-to-end visibility, with context, for every interaction and business process. Build a strong data foundation with Splunk.