The amount of data organizations handle has created the need for faster data access and processing. Data Denormalization is a widely used technique to improve database query performance.
This article discusses data normalization, its importance, how it differs from data normalization and denormalization techniques. Importantly, I’ll also look at the pros and cons of this approach.
What is Data Denormalization?
Data denormalization is the process of introducing some redundancy into previously normalized databases with the aim of optimizing database query performance. It introduces some pre-computed redundancy using different techniques to solve issues in normalized data. These techniques include:
- Table splitting
- Adding derived and redundant columns
- Mirrored tables
However, data denormalization introduces a trade-off between data write and read performances.
How data denormalization works
Let’s first look at a normalized database. A normalized database will have a single copy of each piece of information, and the related data will reside in separate tables. You have to use queries to combine data from multiple tables and use that data in real-world applications.
However, assume that the amount of data increases, or that there are complex queries joining multiple tables. This can degrade the performance of the database significantly or lead to crashes.
For example, pretend that you run an online shop where customers can order products. Such an e-commerce platform typically contains customer data in one table and order data in another. The order data table contains a foreign key to the customer data table to relate to customer information.
Suppose the website needs to display all orders placed by a customer. A normalized data store would require a join between the orders and customer tables to achieve that. Imagine the online shop got a high volume of orders and customers. In that case, this join operation could be computationally intensive and slow down the performance of the website.
In this instance, we can introduce data denormalization to improve performance in such circumstances. You can either:
- Add a customer name column to the orders table.
- Introduce a separate table and merge the data from two tables.
Comparing data denormalization vs data normalization
Data normalization is the process that removes data redundancy by keeping exactly one copy of each data in tables. It maintains the relationship between data and eliminates unstructured data. There are mainly four ways to apply data normalization: first, second, and third normal forms, and Boyce and Codd Normal Form (3.5NF).
A normalized database helps standardize the data across the organization and ensures logical data storage. Normalization also offers organizations a clean data set for various processes, improves query response time, and reduces data anomalies.
So, we can sum up the differences in data denormalization and normalization in two key ways:
- Data normalization removes redundancy from a database and introduces non-redundant, standardized data. On the other hand, Denormalization is a process used to combine data from multiple tables into a single table that can be queried faster.
- Data Normalization is generally used when the joins between tables are less expensive and there are many Update, Delete and Insert operations on the data. On the other hand, Denormalization is useful when there are many costly join queries in databases.
Data denormalization techniques: How to denormalize data
Database administrators use several data denormalization techniques depending on the scenario. However, remember that those techniques have their own pros and cons. Here are some examples of data normalization techniques used by database specialists:
Technique 1. Introducing a redundant column/Pre-joining tables
This technique can be used when there are expensive join operations and data from multiple tables are frequently used. Here, that frequently used data will be added to one table.
For example, let’s say there are two tables called customer and order. If you want to display customer orders along with their names, adding the customer name to the order table will reduce the expensive join operation. However, it will introduce massive redundancies. Here's an illustration:
Technique 2. Table splitting
Table splitting is the process of decomposing a table into multiple smaller tables so they can be queried and managed easily. Table splitting can be done in two ways: horizontal table splitting and vertical table splitting.
Horizontal table splitting
Splitting table rows into smaller tables. Each table will have the same columns. This approach is useful when data tables can be separated based on regions, physical locations, tasks and many more scenarios.
For example, imagine a table containing student information for all departments in the science faculty of a university. As the diagram illustrates, this table can be split according to each department, such as computer science, chemistry, maths and biology.
Here, only a smaller data set will have to be queried compared with the original table. Thus, this technique enables faster query performance for department-based queries.
Vertical table splitting
Vertical splitting is splitting a table based on columns, applying the primary key to each partition.
For example, think that a hospital maintains a ‘Patients’ Table with patient ID, name, address and medical history columns. We can create two new tables from it using vertical partitioning: ‘Patient_details’ and ‘Patient_medical_history,’ as shown in the below figure.
This approach is best suited when some table columns are frequently accessed more than others. It will allow getting only the required attributes, eliminating unnecessary data.
Technique 3. Adding derived columns
Consider the following example. Let’s say there are two tables, Student and Student_Grades:
- The student table has only student information.
- The Student Grade table has marks for each assignment along with some other data.
If the application requires displaying the total marks for the students with their details, we can add a new derived column that contains the total marks for all the assignments for each student. Therefore, there is no need to calculate the total marks each time you query the database.
Technique 4. Using mirrored tables
This technique creates a full or partial copy of an existing table, which will be stored in a separate location and optimized for faster query performance. Generally, the mirrored table will be used for read-heavy workloads using techniques like creating additional indexes and data partitioning. That mirrored table can create read-heavy processes like analytics queries.
This approach involves creating replications of databases and storing them either in separate database instances or on a physical server. However, it involves complexities like maintaining multiple copies of data and keeping them in sync, which can be costly and require more resources.
Technique 5. Materialized views
Materialized views are pre-computed query results stored in a separate table. They are typically ‘join’ and ‘aggregation’ queries that are quite expensive and result in frequently accessed data. Next time, the database can pull the data from the view when needed rather than execute the same query repeatedly.
Pros of data denormalization
Data Denormalization brings several advantages for organizations.
Improve user experience through enhanced query performance
Querying data from a normalized data store may require multiple joins from different types of tables, depending on the requirement. When the data grows larger, it will slow down the performance of Join operations. It can negatively impact the user experience, especially when such operations are related to frequently-used functionalities.
Data denormalization allows us to reduce the number of joins between tables by keeping frequently accessed data in redundant tables.
Reduce complexity, keep the data model simple
Data denormalization reduces the complexity of queries by reducing the number of join queries. It enables developers and other application users to write simple and maintainable codes. Even novice developers can understand the queries and perform query operations easily.
Plus, this simplicity will help reduce bugs associated with database operations significantly.
Enhance application scalability
Denormalization reduces the number of database transactions when reading data. This approach is particularly helpful when a high user load results in a heavy load of database transactions. This reduced number of transactions accommodates varying user loads, improving the scalability of applications.
Generate data reports faster
Organizations use data to generate endless reports, such as usage statistics and sales reports. Generating such reports can involve data aggregation and summarization by searching the whole data set. Data normalization techniques like mirrored tables allow organizations to optimize the databases specifically for daily report generation without affecting the performance of master tables.
Cons of data denormalization
As discussed in the above section, data denormalization offers several advantages. However, this technique can also have some disadvantages that you may need to consider when using it.
- The most obvious disadvantage is increased data redundancy.
- There can be inconsistencies between data sets. For example, consider mirrored databases. It requires taking replications, which need to be synced to make them up-to-date. Inconsistencies can arise if there is a failure in the replica.
- Techniques like data splitting and mirrored tables will require additional storage space, which can be costly.
- Denormalization also increases the complexity of the data schema. It will be harder to maintain the data store as the number of tables increases.
- Inserts and Updates will be costly.
- Maintenance costs can be high due to the increased complexity and redundancy of the data.
Data: to normalize or not
Data normalization removes redundant data. At the same time, denormalization creates redundant data over normalized databases to improve data read performance at the expense of data update and insert performances. There are several denormalization techniques, such as partitioning tables and introducing derived and pre-joining tables. Some advantages of this technique include improved user experience, query performance, scalability, faster report generation, and reduced data complexity.
Yet, denormalization also introduces several cons, like data redundancy, increased complexity, maintenance and storage costs and data inconsistencies.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.