Skip to main content

DATA INSIDER

What Is Big Data?

Big data is a concept that came to prominence in the 1990s in response to the massive increase in the size of datasets at the time, attributed to the growth of the internet and the rapidly declining price of data storage. While you might think of big data as terabytes of data, the term generally means more than simply "large in size." Big data differs from traditional data in that it’s almost always a combination of both structured and unstructured information, which requires new methods of processing and analysis in order to generate actionable insights that can incite strategic decision making.

Big data can be drawn from structured, unstructured or semi-structured datasets, but the real value is realized when these various data types are pulled together — in fact, its value is contingent upon both the amount and variety. Big data can come from just about anywhere, from a business’s sales and production records to public databases to social media feeds. Finding innovative ways to uncover patterns and correlations among these various data sources is the most essential function of a data scientist or big data analyst.

Big data analytics is a complex category that requires a substantial level of skill and training to master, along with comprehensive data management platforms. Tools such as Apache Hadoop, Storm and Spark are invaluable for processing massive amounts of data, finding personnel skilled in using these tools can be difficult (and costly) in a market that is hungry for the insights that big data can provide. But while many of these tools are democratizing big data efforts, they still have a long way to go before becoming wholly accessible. Thus, a key advancement for organizations with copious amounts of data is Map Reduce technology, which addresses this issue by helping organizations achieve value from their information in near real time.

 

In this article, we’ll look at the characteristics of big data, some of the most common use cases for big data, the tools essential for managing it, and best practices for starting a big data program in the enterprise.

What Is Big Data?| Contents

Big Data: What It Is

What does big data mean?

Big data can mean a number of things, depending on the industry. Manufacturing businesses use big data generated by industrial internet of things (IoT) sensors, using various algorithms to predict equipment problems, determine optimal maintenance schedules, and improve performance over time. In healthcare, big data is used to track the spread of diseases, determine therapeutics for the sick, and even uncover instances of insurance fraud. Your bank may use big data to combat money laundering, while your investment advisor may use it to develop an optimal financial strategy.

Ultimately, without context the term “big data” doesn’t have any specific meaning and it rarely refers to any particular static dataset. Any analysis can draw from various datasets deemed relevant and used to comprise the big data store. In other words, it is only once a use case is identified that big data takes on any specificity.

Why is big data important?

Big data is important because many present-day questions are too complex and simply cannot be answered without it. Big data is used regularly for business intelligence in a wide range of industries to better understand customers, improve quality, develop innovative new products, uncover criminal activity, discover disruptions in a supply chain and solve long-standing scientific conundrums.

 

Big data also provides tangible benefits that previously went unnoticed, instead allowing organizations to generate once-hidden insights and connections, usually through intuitive dashboards and visualizations. For example, big data helps businesses find opportunities to reduce costs and improve products by analyzing information about the way those products are manufactured; better understand customer experience through support calls and social media channels; and improve market outcomes by analyzing competitors’ sales data. Without a successful big data strategy, many of these insights would simply not be available.

Big Data Use Cases

What are the types of big data?

In broad terms, data can be categorized as one of three types:

  • Structured data: This is the building block of computing: databases full of customer information, spreadsheets documenting purchases and expenses, including Excel, Google Sheets, SQL and file formats such as JSON. Analysts can use structured data to find summations and averages, suss out trends and make quantifiable decisions. Structured data is the bread and butter of any type of analytics, but unfortunately, very little data is inherently structured in nature.
  • Semi-structured data: As a middle ground between unstructured and structured data types, this refers to unstructured data that has been tagged with some form of structured information. When you snap a photograph, for example, your camera may tag it with the time and date the picture was shot and even its GPS location. With this metadata analysis, semi-structured data can be easier to work with than unstructured data, although ultimately the insights are typically found in the unstructured portion of the documents.
  • Unstructured data: This type of data comprises the vast majority of data, existing in the form of YouTube videos, social media posts, podcast files and rolls of photographs, to name just a few examples. While unstructured data is filled with valuable information, it requires big data technologies to glean those insights. Machine learning technologies can analyze archives of photographs to determine the specific contents of each picture, for example. Despite its name, unstructured data is often available in such vast quantities that even an initial analysis can yield immense business value.

How is big data used?

Big data becomes most valuable when organizations use a variety of data that includes structured, unstructured and semi-structured datasets in unison to unearth interconnections and patterns that would otherwise be invisible to the user. When applied properly, these techniques allow the development of a vast array of big data use cases.

For example, big data analytics can ingest a company’s sales history, social media posts with keywords related to its products, and various online product reviews to determine whether or not a certain product should be discontinued, revamped or put up for sale. Big data solutions can also ingest genomic data from thousands of patients, along with their medical history, to help determine the specific genes responsible for certain medical conditions and point the way to treatments. It’s also used regularly for oil extraction and other natural resource exploration, with data generated by geological surveys, machinery at nearby drilling sites, and even seismic records to locate new, promising drilling locations.

Big data is used to process seismic information that can help detect and predict earthquakes or locate promising drilling locations.

Benefits and Challenges of Big Data

What are the benefits of big data?

Put simply, big data allows access to insights that would otherwise be unavailable. When used properly in data science, for example, big data can reduce costs, boost sales, optimize pricing, create better targeted marketing and advertising campaigns, and improve customer satisfaction levels. On the product side, big data can be used to improve product performance, reduce waste and overhead, streamline production costs, and improve the uptime of manufacturing equipment. Big data can locate instances of financial fraud and criminal activity, and it can be used to discover previously unknown medical therapies. Depending on the specifics of the industry or company, there is really no limit to the benefits that big data technologies can provide.

What are the challenges of big data?

Generating value from big data is not easy. It requires advanced software, significant expertise and — of course — a lot of data. Here are some of the specific challenges you might encounter getting a big data project underway.

  • Data quality issues: The old adage “garbage in, garbage out” is especially true with big data: If you have a lot of garbage in, you’ll get a lot of garbage out. Big data professionals have to ensure that the underlying datasets with which they are working are high in quality, otherwise they risk generating incorrect, inaccurate or misleading insights.
  • Privacy and compliance concerns: Certain datasets come with risk attached: Financial data may be subject to regulation. Customer information and healthcare information may be subject to compliance regulations such as GDPR or HIPAA. Navigating regulatory complexities around large datasets can quickly get complicated, requiring increased oversight to ensure the organization doesn’t run afoul of relevant laws.
  • Availability and cost of computing power: Processing big data requires big computing resources, in the form of both storage and compute capabilities. This kind of power does not come cheaply, although organizations have the option to “pay as you go” with readily available cloud computing capabilities. Even still, expenses can quickly add up — particularly for organizations new to big data, which may be more likely to lack experienced personnel and expertise, in turn requiring substantial amounts of rework.
  • Lack of available big data talent: Big data remains an undersubscribed skill, making it difficult to find qualified data scientists able to effectively design and execute a big data strategy. Many businesses are choosing to uplevel internal staff with the necessary big data expertise rather than attempt to compete for a shrinking pool of talent in the market.

How It Works

How is big data collected?

Big data can be collected from a wide variety of sources. While the sources of data are theoretically endless, they can include the following:

  • Users: Users can provide data directly by filling out a form or survey, creating a social media post, or by making a purchase from the company and creating a personal profile, to name a few examples. Some user data can be generated passively, such as through interactions with a website or when logging in and out of a network.
  • Applications: Applications running within the enterprise generate a vast amount of data. Data from security vulnerability scanners, application performance management systems, mail servers, and anything else that generates logs of information can be invaluable when analyzing infrastructure performance.
  • Middleware: The systems that run the core of the enterprise — applications and web servers — can be a trove of big data information.
  • Networks: Network logs are filled with useful information that can help pinpoint network infrastructure problems, including information logged by routers and switches, FTP servers and DHCP servers.
  • Operating systems: Operating systems log performance and error information, which is useful for optimization-oriented big data analyses.
  • Cloud and virtual infrastructure:  As data has migrated off-premises and to the cloud, platforms like Google Cloud Platform, Microsoft Azure and Amazon AWS have emerged as key big data generators. The extensive logging capabilities of these services (and the infrastructure running on them) provides significant analytical opportunities.
  • Physical Infrastructure: Server hardware, point of sale devices and storage arrays can all offer a deep level of insight to a big data analytics platform. Sensor data sourced to manufacturing devices embedded in production machinery are some of the most valuable forms of big data today.

What is big data analytics?

Big data analytics is simply the process of using tools and technologies, such as artificial intelligence, to analyze big data stores that can sometimes incorporate terabytes or petabytes of data and generate actionable insights. In other words, big data refers to the data itself, while big data analytics refers to the processing of that data. In practical terms, the term “big data” is often used as a shorthand to mean big data analytics; after all, “big data” without analytics applied to it is functionally useless.

What are big data tools and technologies?

Since the field of big data became popularized in the mid-2000s, it has exploded with a variety of tools and technologies to support big data analytics. Here’s a rundown of some of the major big data tools and technologies offered on the market today allowing you to process a high volume of data. While some were developed by private providers, most of these technologies are now open-sourced and managed by Apache.

  • Hadoop: One of the original and most essential big data analytics frameworks, Hadoop remains a fundamental technology in your data ecosystem, specifically designed to both store and process large volumes of data of almost any type.
  • Apache Spark: Spark touts high velocity that's faster than Hadoop, thanks to more efficient API, though it lacks a distributed storage mechanism. It’s currently one of the most widely used big data engines, integrating with dozens of additional computing platforms.
  • Storm: Another take on big data processing, Storm is designed to process real-time data rather than batches of historical metadata — which is how Hadoop and Spark operate — and is considered one of the fastest big data systems on the market today.
  • Hive: A SQL-based Hadoop add-on used primarily for processing large amounts of structured data.
  • Kafka: Another widely used platform designed for analyzing data.
  • HPCC:  Short for High Performance Computing Cluster, HPCC is a competing platform to many of the above tools, working with both batch and real-time data.
  • Tableau: A popular big data tool (not open source) that is more readily accessible to the masses, Tableau allows non-big data professionals to glean insights from large datasets, though it lacks the enterprise-level power of more sophisticated tools.

While these are some of the foundational technologies in the big data field today, many additional tools are available in what is now a surprisingly crowded market.

Getting Started

What are some big data best practices?

Big data analytics is complex and can be costly if not undertaken with considerable attention to best practices. Here are some of the key big data principles.

  • Develop goals for your big data strategy before diving in: What are the overarching objectives you are trying to reach? (Better understand customers? Redesign product? Detect fraudulent behavior?) Before installing software and ingesting data sources, determine what you’re really trying to achieve.
  • Develop a scheme and information architecture: Developing an information architecture is critical so organizations can properly and adequately handle data ingestion, processing and analysis that is too large or complex for traditional data systems. There are many available tutorials to help you get started.
  • Understand what data you have: Taking an inventory of your data can be complex and difficult. Much of this material may be held in databases no longer active, retired backup archives, or in formats that are no longer compatible. You will likely need to do a lot of work to determine exactly what data you have — and what additional data you might need to get started.
  • Determine how clean your data is: Is data corrupt? Does it need to be reformatted into a more useful structure? Does the data actually contain the information you expected it to have?
  • Develop your big data strategy with security in mind: Big data can be filled with minefields of confidential information, financial data and other sensitive materials. Big datasets can be hacked and exploited just like any other type of data, you must undertake measures to protect it through encryption, a robust backup strategy, and other data security defenses.

What is the future of big data?

In many ways, the future of big data is the future of data: Data volumes continue to exponentially increase. To that end, IDC predicted in March 2021 that data created over the next five years will more than double from the invention of digital storage. And the pandemic-driven rush to remote work environments has only exacerbated this trend. Data is created in more places and by more people than ever — including mobile devices, IoT hardware, social media and more. Determining what is valuable, capturing it and understanding it will pose a significant challenge to the enterprise for the foreseeable future.

The Bottom Line: Big data is an essential tool for generating business insights

Today, no enterprise can thrive without a solid understanding of its data, and increasingly that means understanding data on a massive scale. As a discipline, big data analytics is becoming an essential part of doing business, and few decisions of any importance are now able to be made without it. Any business looking to maintain competitiveness in the next decade will need to ensure it has a solid understanding of the available big data sources, the tools it needs to analyze that data, and staff trained in related analysis.

What is Splunk

 

More resources