Data is increasingly becoming the most valuable asset for any organization (besides customers and employees) and is a true source of sustainable competitive advantage. To realize this potential, organizations are grappling with two fundamental challenges—how to decipher and decode the signals from the noise in a timely manner, and secondly, how to cost-effectively manage and retain this data.
The very genesis of Splunk lies in providing the tools for organizations to gain insights in the chaotic world of machine data. However, as the volume of data grows, organizations run into the second challenge of retaining this voluminous data. The distributed horizontal scale out model that evolved over a decade back with the evolution of Hadoop and Big Data technologies is running into challenges unforeseen at the time of its inception.
The distributed scale out model was a good fit to process large volumes of data by co-locating compute and storage. However, it was not designed for the scale at which data is exploding, wherein the demand for storage demand is outpacing the demand for compute. At large scale, the current approach of adding more compute and storage in response to a increase in purely storage demand is sub-optimal and highly cost prohibitive. Hence, it's imperative to decouple storage and compute to provide a more efficient and cost-efficient solution. There is an industry-wide shift to decouple compute and storage for large scale data deployments
"Decoupling compute and storage is proving to be useful in Big Data deployments. It provides increased resource utilization, increased flexibility, and lower costs." – Ritu Jyoti, research director, IDC's Enterprise Storage, Server, and Infrastructure Software team
Adding more nodes in response to increased storage demand points to another flaw in the distributed horizontal scale out model. Looking closely into data processing requirements, it is likely that most searches are running over a smaller subset of data and very rarely on the entire dataset. Research has indicated that more than 95% of Splunk searches are over data less than 7 days old. Since the active dataset in a majority of cases is a much smaller subset of the entire dataset, the compute added in the distributed scale out model at scale in most cases is heavily under-utilized.
Hence, to break this dichotomy between compute and storage requirements, a model that allows storage to be scaled independent of the compute is much needed. Alternative solutions such as NFS/SAN for cold volumes have often been leveraged by organizations as a means to allow for older datasets to be scaled independently. The challenge with this deployment model is that NAS/SAN storage is not suited for managing large scale volumes and is hosted on a more expensive tier than direct attached storage. The innovation in the cloud and object storage space, pioneered by hyper-scale cloud/storage providers, allows for massive data volumes to be stored more efficiently and more cost-effectively.
Delivering performance at scale in this decoupled architecture requires another key capability—the ability to dynamically bring in active datasets closer to the compute on-demand and process the data without impacting the user search experience. This allows for organizations to process data independent of the data age and storage placement, while keeping all the data searchable at any point in time.
In the next and concluding blog in this Splunk SmartStore series, "Splunk SmartStore: Cut the Cord by Decoupling Compute and Storage," I'll share how Splunk's new SmartStore offering is disrupting the existing data management paradigm and allowing customers to harness the true potential of data at lower cost economics while delivering on performance at scale.