
Managing data effectively is a critical need for any organization, but how is it done properly to ensure a cost effective, reliable solution? As your data ingestion grows, you should consider a few key approaches to help control costs and ensure data integrity. These approaches include:
- Understanding the overall data lifecycle.
- Knowing how to maximize performance when searching data while minimizing the overall cost.
What is the data lifecycle?
The data lifecycle is a continuous process made up of multiple phases that represent how data is created, fed into the system and pushed out of the system.
New data will fall into the first phase as it’s created but before it’s indexed. As data ages, it will traverse each phase until it is ultimately destroyed. Below is a diagram showing the different phases, along with a short description of each.
- Create. Actions are executed on remote servers which log the activity of that action. This creates a new time-series event within a log file.
- Collect. An agent may be sitting on that remote server, watching that path and collecting the data that will be indexed by a logging platform.
- Use. Data will be correlated and transformed so that it can be used cohesively to identify insights and help solve problems efficiently.
- Archive. Older time-series data that falls outside of searchable retention periods will be archived to slower, cheaper storage for possible use later.
- Destroy. The oldest time-series data that falls outside of non-searchable retention periods will be destroyed and lost forever.
(See what time-series forecasting can do.)
Challenges with data management
As data volume and velocity grow, it’s important to adopt a data management strategy to help manage costs and appropriately scale the environment. While there's no single strategy that will lead to success, there are certain things to consider before developing your strategy. These include:
- Storage costs
- Search speeds (performance)
- Data retention
- Issues with access to data collections
- Restricting access to data
- Searchable vs. non-searchable data
Some of these may not be problems when you first start out on your data journey, but as you begin to scale your data platform, the system will be less forgiving and problems will begin to increase exponentially. Some companies like to solve these problems by throwing money at expensive storage and classifying all data as having the same value. More creative companies take a lean approach to solving this issue by routing certain types of data to different storage tiers or putting low-value, infrequently-accessed data into non-searchable storage systems to minimize overall costs.
While cost reduction is very important when it comes to data management, data integrity is equally important, since you need to ensure that the data is available when you need it.
(Compare data lakes and data warehouses, common big data storage options.)
Comparing data management vs data governance
Let’s now turn to two overlapping practices: management and governance.
What is data management?
Data management is the process of collecting, routing, securing and storing data so that it can be used at a later time to derive insights or provide a storyline of the actions that occurred at a particular time amidst many systems. All components of data management are equally important because if any one piece in the process fails, the overall data management process could be unusable, and you would lose access to valuable insights.
What is data governance?
Data governance describes the process of managing certain data attributes, including:
- Access
- Format
- Useability
- Availability
- Data standards
When building a data management platform, you should consider data governance during the design phase and create a strategy for enforcing it.
Data management vs governance
Data governance and data management are similar in that they both establish guidelines for building, storing and using data efficiently so that you can ensure a continuous pipeline of time-series data that is reliable, trustworthy and cost effective.
The data management lifecycle includes details about how data comes into the system, how it is stored and used, and how it moves out of the system over time. On the other hand, data governance describes data management and how the data will be used once it’s in the system to ensure maximum performance and useability.
Best practices for managing data
You may have heard the phrase “data is the new oil,” but what does that really mean? As with oil, data is rather useless in its purest form, and it provides little value without some form of refinement.
Use data pipelines
Data must be routed through pipelines and refined into a more usable form, then stored in an efficient way to minimize costs and ensure useability when needed. Once used, data starts losing value faster than a new car that’s driven off the lot. It will eventually age out and fall out of the queue. After that, it will no longer be used.
(Read our data pipeline explainer.)
Use fast storage for frequently accessed data
Newer data is usually more valuable than older data of the same type. You want the fastest storage on your hot and warm buckets, since it’s likely that this new data will be searched before it ages into the next bucket.
Depending on storage, velocity, and retention, data can sit in the pipeline of buckets for months or even years before it rolls into the frozen bucket (which represents non-searchable archived data). The frozen bucket should have the cheapest, slowest storage, since the data will likely be around for a very long time and may or may not be searched again. Lastly, depending on retention settings, data will age out and get purged from the frozen bucket. Then, it will be gone forever.
Integrate governance with data management
Data management and governance should not be thought of as separate processes that are developed independently of each other; rather, you should integrate your governance strategy into your data management strategy when scaling. It provides a rule book that describes how to solve common problems at scale, such as:
- Properly securing data
- Ensuring retention times
- Maximizing performance while minimizing costs
Don’t let BIG data problems leave you unprepared for future growth!
What is Splunk?
This article was written by Steve Koelpin. Steve is a seasoned IT professional with deep experience on the Splunk platform who specializes in machine learning, IT Service Intelligence and development. He's traveled the country as a Splunk professional services consultant solving big data problems for dozens of companies while enabling users to become Splunk rockstars. When Steve is not busy on the keyboard, he's spending time with his new baby and happy family.
This posting does not necessarily represent Splunk's position, strategies or opinion.