Skip to main content

DATA INSIDER

What Is a Data Lake?

A data lake is a data repository for large amounts of raw data stored in its original format — a term coined by James Dixon, then chief technology officer at Pentaho.

With the rapid growth in the amounts of big data generated, ingested and used by organizations every day, data lakes provide the ability to store data as rapidly as it’s received. Data scientists who use data lakes rely on data management tools to make the data sets usable on demand, for initiatives around data discovery, extraction, business intelligence, cleansing and integration at the time of search.

In the following article, we’ll discuss the components of a data lake, as well as explain how data lakes are used, what their advantages and potential drawbacks are, and what the future of data lakes is in enterprise data storage and management.

What Is a Data Lake? | Contents

james-dixon-image

James Dixon, former Pentaho CTO, first coined the term “data lake".

How is data stored in a data lake?

A data lake is a repository of terabytes or petabytes of data in its raw format without being sorted or indexed. The data can originate from a variety of data sources: IoT and sensor data, a simple file, or a binary large object (BLOB) such as a video, audio, image or multimedia file. Any manipulation of the data to put it into a pipeline and make it usable is done when the data is extracted from the data lake.

What is data lake architecture? Is a data lake composed of structured or unstructured data?

Data lakes are built using simple object storage methods in order to house many different formats and types of data. Organizations traditionally built data lakes on-premises — and many still do. However, many companies are also moving their data lakes to remote servers, using cloud storage solutions from major providers such as AWS and Microsoft, or a distributed file system such as Apache Hadoop.

Data stored in a data lake can be structured, semi-structured or unstructured data. Even if it is structured data, any metadata or other information appended to it is not usable. Data in a data lake needs to be cleansed, tagged and structured before it can be applied in use cases. These functions are performed when the data is extracted from the data lake to be made ready for use.

How do you develop a data lake platform?

In and of itself, a data lake is a collection of data stored in its native format on a server, either on-premises or in the cloud. While there doesn’t seem to be a widely accepted definition of “data lake platform,” ancillary services are required to manage the servers, provide security and storage services and make the data available for extraction and use. In other words, a data lake could be the data itself, and the data lake platform the servers, other equipment, hardware and software used to operate and maintain it.

Most resources that describe best practices for developing a data lake are describing best practices for any major technology undertaking in a large organization:

1. Gather relevant stakeholders and decide on your goals.

2. Develop an action plan and assign ownership of the project.

3. Evaluate the methods available.

4. Select the best server architecture for your needs.

5. Pick a vendor.

6. Ensure your organization’s data governance, security and privacy standards are maintained.

What is the difference between a data warehouse and a data lake?

A data lake can contain a mix of structured, semi-structured and unstructured data, while a data warehouse contains only structured data. In most data warehouses or data centers, the data has been ingested through an extract, transform, load (ETL) process. It is then organized (staged), cleansed, transformed, catalogued and made available for use.

structured-data

Data lakes contain a mix of structured, semi-structured and unstructured data, stored without being cleansed, tagged or manipulated.

What is the difference between a database and a data lake?

A database (including a database management system) is used for storing, searching and reporting on data. Unlike data lakes, databases may require schemas and cannot contain semi-structured or unstructured data. On the other hand, a data lake can store raw data from all sources, and structure is only applied to the data when it’s retrieved. Using a data lake doesn’t allow for the same reporting capabilities you would have with a database.

What is the difference between a data lake and the cloud?

A data lake is a collection of data and can be hosted on a server based on an organization’s premises or in a cloud-based storage system. The cloud, or cloud services, refers to the method of storing data and applications on remote servers. Also known as a cloud data lake, a data lake can be (and often is) stored on a cloud-based server.

How are SQL and NoSQL related to a data lake?

Structured Query Language (SQL) is a programming language used for managing relational databases, along with NoSQL, which is a different language defined as non-SQL or non-relational. Because data lakes store unstructured data, neither SQL or NoSQL is applied to the data stored in a data lake. When the data is extracted, depending on the organization’s data network, SQL or NoSQL may be used to prepare the data for use in a database.

How is a data lake used in the enterprise?

Companies are constantly being told that data is their most valuable asset. Machine learning and other advanced analytics provide a self-service option for administrators to glean insights from an organization’s historical data and use it to predict future outcomes, whether that means protecting the company from external threats to their networks, finding ways to streamline and optimize workloads, or keeping networks up and running. Historical sales and marketing data can be used to predict future performance, and as more data becomes available — along with more sophisticated machine learning and big data analytics tools — those predictions become more accurate. In order to take advantage of machine learning and predictive analytics, organizations need to be able to store and access as much data as possible.

Data lakes, such as an Azure data lake, provide the ideal environment for a growing organization to store data that it knows may be useful, without the delay, effort and expense of cleansing and organizing data in advance. Because of their simplicity, data lakes are also much more easily scalable than structured data storage. Data lakes are one of the most important tools enterprise companies have to get the most value out of their data.

What are the benefits of using a data lake?

The primary benefits of a data lake are speed, scalability and efficiency. With the ever-growing volumes of traditional data created, ingested and stored by a modern organization, there is significant utility in being able to have a low-cost means of storing data quickly and enabling any authorized person to access data for use rapidly and on demand.

Data lakes are data repositories or vast stores of information that may not contain metadata, but at the same time allow for on-demand search, including data discovery, data processing, ingestion and extraction, data integration and cleansing.

What's more, data lakes can help break down data silos that have typically impeded organizations from realizing the value of their data. Imagine if you were able to take any item you use as part of your life — from your insurance policies to your house keys to your passport to your gym bag — and drop it into a box. Now imagine that at the moment you needed a particular item, you could put your hand back into the box and immediately retrieve it. Data lakes work in much the same way, thanks to on-demand search capabilities made possible by machine learning.

What are the drawbacks of using a data lake?

There are no disadvantages to a data lake, because a data lake is just an accumulation of data waiting to be used. That being said, data lakes require support, often by professionals with expertise in data science, to maintain it and make the data useful. In other words, if you compare a data lake to a structured, relational database, the data lake may seem disorganized, although that isn’t necessarily a fair or accurate comparison.

When a data lake is not managed properly it can sometimes be referred to as a “data swamp.” There are no drawbacks to a well-managed data lake, but if allowed to become a data swamp then the data quality, as well as its usefulness and value to the organization, deteriorates, increases latency, and becomes a liability for the company. At some point, a data swamp has the same drawbacks and challenges — as well as opportunity cost — of dark data (either stored or real-time data that a company possesses but cannot find, identify, optimize or use).

What is the future of data lakes?

data-analysis-inset-image

Data lakes require support by analysts who help the organization realize the data’s potential value. 

The future of data lakes mirrors the future of data itself. As the amount of data generated, needed and used by organizations continues to grow at increasing rates, the need to store large amounts of data will continue to grow just as quickly. Unlike databases or data warehouses, data lakes allow organizations to quickly and efficiently store data that they know they needYes  either in the present or in the future.

What is Splunk

 

The Bottom Line: Data lakes are key in the future of enterprise data storage

With the growth of machine learning, data has become more available and usable, while data extraction from data lakes has become significantly faster and easier. Machine learning and data science can make dark data a thing of the past; the more data an organization has, the more information its data analytics systems have to learn from. Data is one of an organization’s most valuable assets. And data lakes give organizations the ability to capture, store and use those assets in the most efficient way.