Many business organizations begin their data analytics journey with great expectations of discovering hidden insights from data. The concept of unified storage — data lake technologies in the cloud — have gained momentum in recent years, especially with the exponential options for cost-effective cloud-based storage services.
Big data is readily available. In fact, 2.5 quintillion (2.5 x 10^18 or 2.5 billion billion) bytes generated every day! The challenge facing these organizations centers around the nature of this data. Big data is generated in three forms — structured, unstructured and semi-structured. Data needs to be preprocessed to specifications before it is ready for analytics consumption.
In this article, we’ll look at what these data structures mean for business analytics.
The limits of structured data
Why not just focus on structured data that complies with the required tooling specifications? Or use a traditional data warehouse system that employs a schema-on-write method to preprocess all data prior to storage as required?
The idea is baked in to data lake technology: data lakes are meant to accelerate the analytics process, turning away no data. Data lakes load all data from source systems directly at the leaf level. This gives analytics teams the freedom to access a growing pool of real-time data streams, processing only the portion of data that is required by the tooling. (In most cases, that portion is well under 10%.)
Unlike the rigid schema-based model of a data warehouse system, data lake allows for scalable analytics operations such as integrating multiple new sources of heterogeneous and real-time data streams, and using tools subject to a variety of data structure specifications.
Structured and unstructured data assets are scaled differently, and there may be no consistent approach to model heterogeneous data assets with a single schema framework.
(Read our data lake vs. data warehouse explainer.)
Three structures of big data
Let’s explore what this means for your data analytics journey:
How structured data works
Structured data follows a fixed predefined format, usually in a quantitative and organized form. A great example is a database with customer names, address, phone numbers, email ID and billing information.
The pros of structured data are clear: this format can be consumed directly by an analytics tool and may not require any additional reformatting. However, this data can only be used for its intended purpose with the tools that require its schema formatting.
Unstructured data requires more work
Unstructured data is usually qualitative data that needs preprocessing before it can be made available to analytics tools for consumption. Examples include:
- Raw IoT data and network logs
- Audio and video data
- Social media posts
- Data generated at the machine level
According to recent research, 80% of all data will be unstructured by the year 2025. In its native format, unstructured data can be stored in a unified storage repository, a data lake. It accumulates and scales rapidly — most real-time data streams are generated in unstructured format. In order to consume unstructured data, you have to use specialized tools and rely on expertise to give it the required structure scheme.
(Learn about normalizing data.)
Don’t be confused: semi-structured data is not “in between” the first two data types. Instead, this is a form of structured data that does not conform to the structure schema of databases.
Data entities that belong to the same class are instead described by metadata tags or other semantic markers that give some structure to the data assets, differentiating it completely from an unstructured data format. As an example:
- Semi-structured data could be a tab-delimited file containing data on marketing leads.
- Structured data could be a CRM database containing all customer details.
- Unstructured data could be a social media post with comments of users expressing varied interest in the product.
Does your data platform need structured data?
If your data pipeline is built with a data lake, you can take advantage of the flat storage architecture to source data in all formats. A pre-built schema is not required and the data can later be queried by giving it some structure as required — schema-on-read — or using the fixed order of data acquisition. Metadata tags are commonly used during the querying process, which means that a solid metadata management strategy must be in place.
The process of extracting, loading and transforming data (ETL) should be automated and simplified to meet the scalability needs of the data platform. Since this preprocessing step only takes place when an analytics application queries the data, the data lake can handle workloads with write-heavy and read-heavy schema requirements. This means that the data platform can be flexible, scalable and cost-effective, given the availability of low-cost cloud storage options.
This pipeline workflow incentivizes organizations to leverage data of all structures and formats, while avoiding the resource-intensive schema-on-write process for real-time unstructured data streams that can quickly grow in volume.
Drain the data swamp
Without an adequate data management strategy in place, your data lake can quickly turn into a data swamp.
An effective data management strategy is focused on security, auditability and transparency of structured, unstructured and semi-structured data assets. The data should be governed and classified to securely manage access between relevant data consumers and data producers, enabling self-service functionality and offering the flexibility to integrate multiple third-party analytics tools, each with their own set of schema and structure requirements.
What is Splunk?
This posting does not necessarily represent Splunk's position, strategies or opinion.