Data Lakehouses: Everything You Need To Know

A data lakehouse is a modern data architecture. It is popular among many organizations that incorporate the features of both data lakes and data warehouses. The features of a data lakehouse make it ideal for a range of data analytics use cases.

This article explains data lakehouses, including how they emerged, how they shape up versus data lakes and data warehouses, their architecture, and finally, the pros and cons of using a data lakehouse. 

What is a Data Lakehouse?

A data lakehouse is a data management solution that leverages the best features of a data lake and a data warehouse into a single, unified platform. It addresses the limitations of data lakes and data warehouses when utilizing them separately.

Here are the highlights:

  • This all-in-one platform enables storing data in raw formats, just like a data lake: in unstructured, semi-structured and structured ways. That means it is a cost-effective and flexible data storage solution, just as any data like is.
  • Data lakehouses add in what data lakes lack. In a data lakehouse, you also get data management, governance, ACID transactions and data quality—the primary offerings of data warehouses. 

What to use data lakehouses for: Key features

Because it has the capabilities of both a data lake and a data warehouse, a data lakehouse can be used for several projects. Some example projects in which it can be utilized include business intelligence (BI), data science, machine learning (ML), AI and SQL analytics. The following is a list of features of a data lakehouse that inherits from data lakes and data warehouses.

  • Support for every form of data in any file format 
  • Concurrent data reading and writing
  • Open source
  • Cost-effective Solution
  • Flexible and Scalable
  • Support for real-time data streaming
  • Data governance and auditing capabilities
  • Accommodate multiple workloads
  • Analytics-ready by supporting open data standards such as AVRO, ORC or Parquet
  • Optimized access to ML and data science tools

How did Data Lakehouses emerge?

Data warehouses emerged in the 1980s as solutions for storing and managing structured data from various sources. They were primarily designed to support data analytics and BI with efficient querying capabilities. Yet, data warehouses couldn't support rapidly evolving unstructured and semi-structured data like pictures, videos or audio recordings. Further, they required data cleaning and transformation to accommodate such data types — this was time-consuming and expensive. 

Data lakes emerged in early 2010 as a solution to address the limitations of data warehouses. They provided a low-cost and scalable option for analytics using various data types and formats. Nonetheless, data lakes still had limitations, such as:

  • Low data quality
  • Challenges in data governance

The result is that people often maintained both options simultaneously, linking them together to avoid the limitations of data lakes and data warehouses. It often led to issues such as data duplication, high maintenance costs, and security challenges.

Data lakehouses emerged as a better solution to address those challenges. They combine the best features of data lakes and data warehouses. The best part is clear: you no longer have to maintain multiple systems for multiple workloads.  

A comparison: Data warehouses vs. data lakes vs. data lakehouses

The following table summarizes how data warehouses and data lakes compare with data lakehouses. It gives you an idea of how data lakehouses combine their features to form a unified solution.



Data Warehouse

Data Lake

Data Lakehouse

Supporting Data Types

Structured data

Structured, semi-structured, unstructured (raw) and textual data

Structured, semi-structured, unstructured (raw) and textual data

Data Formats 


Open data such as Parquet, ORC, AVRO, etc.

Open and standardized formats such as Parquet, ORC, AVRO, etc.

Data Governance

Simple and well-defined data governance

Poor governance capabilities 

Complex but well-defined Data Governance

Data Access

Using SQL-only. Direct file access is not supported

Direct file access is supported with Open APIs

Direct file access is supported with Open APIs and no vendor lock-in


High cost due to proprietary technologies 

Low cost due to open source technologies. 

Low cost for data storage


Limited scalability and it becomes expensive 

High Scalability. It is cost-effective.

High Scalability





Use Cases

Business Intelligence and Reporting

Data Analytics, Data Science, ML, and AI

All use cases of data warehouse and data lakes

Components of a Data Lakehouse: The 5 layers

The data lakehouse has a layered architecture with five layers.

  1. Ingestion layer
  2. Storage layer
  3. Metadata layer
  4. API layer
  5. Consumption layer

Data ingestion layer

The data ingestion layer is the bottom layer of a data lakehouse, and it’s responsible for:

  • Retrieving data from different external and internal data sources.
  • Ingesting that data into the storage layer above it.

Data sources include Relational and NoSQL databases, social media platforms, websites and other organization-specific applications that generate data. The ingestion layer also has data streaming capabilities for real-time data processing from streaming data sources like IoT sensors. 

Data storage layer

The second layer of a typical data lakehouse is the storage layer. It consists of low-cost storage solutions such as AWS S3 and HDFS. Data can be stored as raw data without any transformation, allowing client tools to access that data directly. Additionally, components in the consumption layer and different APIs can access the same data. 

Metadata layer

The third layer is the metadata layer, which stores metadata or all the information of data objects in the data storage layer. It also has data management features like ACID transactions, caching, indexing, data versioning and cloning.

The metadata layer can be seen as a unified catalog of metadata. This layer enables data governance, auditing and schema management functionalities. 

API layer

The API layer hosts different types of APIs for data analytics and other related data processing activities. It allows machine learning libraries like TensorFlow and MLlib to read directly from the metadata layer. DataFrame APIs help with optimizations, and Metadata APIs can be used to understand the required data. 

Data consumption layer

At the top of our layered architecture sits the data consumption layer, which consumes data from the storage layer and accesses the metadata. This layer also hosts various analytics tools like data science, ML and BI tools like Power BI and Tableau. These tools enable organizations to create and run various analytics jobs.

The advantages of Data Lakehouses

Today, you can expect many benefits to using a data lakehouse. At best, you might finally harness the power of your various data sources

Easy data governance & security

Data lakes offer a management interface to easily control access to the data storage and manage compliance and data quality.  It provides fine-grained access control, which can be applied to rows, columns, and views. It also provides attribute-based access control. Additionally, it allows users to set constraints on data quality, data versioning, and data monitoring using an interface like a database administrator. 

Reduced data redundancy

A data lake consists of a single unified data storage that can accommodate various data types and cater to various use cases. This single storage solution helps to reduce data duplications, which become an issue when storing data in several systems separately. 

Cost-effective solution

Data lakehouses use low-cost storage solutions and reduce the costs of handling multiple databases. They are built on inexpensive and flexible storage technologies such as:

  • Cloud storage, which can scale on demand and cost per usage.
  • Hadoop Distributed File System (HDFS), which can store data across multiple servers in a cluster.

In addition, data lakes reduce maintenance costs as they do not require complex ETL processes to prepare the data for analytics and machine learning workloads. 

Support multiple workloads using a single platform

The combination of data lakes and data warehouses enables organizations to run multiple workloads. Several users — data developers, business analysts, data scientists — can use the analytics tool of their choice.

Data lakehouses provide direct access to some of the most widely used business intelligence tools, such as Tableau and PowerBI. They also support open data formats and machine learning libraries, such as Python and R libraries. It allows machine-learning engineers and data scientists to fully leverage the power of big data. 

Supports innovation, customer interactions

Data lakehouses enable R&D teams to research and test innovative solutions for customer issues as they support the integration of multiple data sources and workloads. It lets them focus on innovations rather than time-consuming data transformation and pre-processing, making them analytics-ready. 

Drawbacks of data lakehouses

One main challenge of adopting a data lakehouse is its high learning curve. An organization needs to learn and become familiar with new technologies and tools to completely transform into a data lakehouse. It may involve extra time, effort, and costs for training your organizational staff on how to use and operate the data lakehouse. 

Another challenge associated with data lakehouses is storing raw data, which can become unusable without proper security and cataloging. The data cannot be fully trusted without these mechanisms.

Meet you at the (data) lakehouse

The concept of a data lakehouse has emerged as a solution to address the limitations of data warehouses and data lakes. It is a low-cost solution that can store data in various formats and facilitate a range of data analytics workloads. Furthermore, it offers centralized and unified data storage, which is flexible and efficient. A Data Lakehouse is also the best solution for data governance and security.

There are five components of a data lakehouse: data ingestion, storage, metadata, API, and the data consumption layer. As discussed in this article, data lakehouses can benefit organizations in different ways. However, there are also limitations like the higher learning curve and complexities of migrating to this solution and storing raw data. You must consider these facts when leveraging a data lakehouse for your organization. 

What is Splunk?

This posting does not necessarily represent Splunk's position, strategies or opinion.

Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate about everything she does, loves to travel and enjoys nature whenever she takes a break from her busy work schedule. She also writes for her Medium blog sometimes. You can connect with her on LinkedIn.