Learn

April 07, 2023

6 Minute Read

Data Streaming: A Complete Introduction

By Shanika Wickramasinghe

Data streaming is the backbone of so many technologies we rely on daily. Endless data sources that generate continuous data streams. Dashboards, logs and even streaming music to power our days. Data streaming has become critical for organizations to get important business insights — when you can get more data from more data sources, you might have better information to run your business.

This article explains data streaming, including:

Streaming data sources
The importance of data streaming
Differences between traditional batch processing and stream processing.
The advantages & limitations of some popular data streaming technologies

Let’s get started!

What is data streaming?

Data streaming is the technology that constantly generates, processes and analyzes data from various sources in real-time. Streaming data is processed as it is generated.

(This is in direct contrast to batch data processing, which process in batches, not immediately as generated. More on that later.)

Streaming data from various sources can be aggregated to form a single source of truth. Then, you can analyze that single truth to gain important insights. Organizations can then use these insights to:

Make quick decisions.
Provide a better customer experience.
Make business activities more efficient.

Examples of streaming data sources

Today, various applications and systems generate such streaming data in various formats and volumes. Here are common examples of such data sources and how they are being used:

Sensors placed in industrial equipment, transportation equipment, etc. generate streaming data for applications that perform various tasks like performance monitoring and identifying defects.
Social media posts, comments, likes and shares generate real-time streaming data.
Sensors in IoT devices generate streaming data like weather (temperature, humidity, precipitation, and wind speed) and location data.
Multimedia channels like YouTube and Spotify generate streaming audio and video data.
Financial institutions use stock market data to update their stock price-related activities.
Gaming applications generate data streams from player actions and gaming scores.

The importance of data streaming

Traditionally, businesses performed data processing in batches, collecting them over time and saving computing resources and processing power. However, with the introduction of IoT sensors and the growth of social media and other streaming data sources, streaming processing has become critical for modern businesses.

These sources constantly generate a large amount of data every second, making it difficult to process with traditional batch processing techniques. Plus, the amount of data we generate far outpaces any previous data volumes. That makes storing all data in a data warehouse when it is generated even more difficult.

Data stream processing is critical for avoiding massive storage needs and it enables faster data-driven decisions.

Batch processing vs. stream processing

Batch and stream processing are two ways of processing data. The following table compares the important characteristics of both processing types, including data volume, processing and latency.

Characteristic	Batch Processing	Stream Processing
Data volume	Processes large batches or volumes of data.	Processes a small number of records, micro batches or individual records.
How data is processed	Processes a large batch of data at once.	Process data as and when it is generated, either over a sliding window or the most recent data in real-time.
Time latency	High latency as it must wait until the entire batch is processed. Thus, the latency can range from minutes to hours.	Low latency as it processes in real-time or near-real-time. Latency can range from seconds to milliseconds.
Implementation complexity	Simpler to implement	Requires more advanced data processing and storage technologies.
Analytics complexity	It is complex to do analytics since large volumes of data need to be processed at once.	Simple functions make analytics simpler than batch processing.
Cost	More cost-effective because there is less demand for more efficient data processing capabilities. However, data storage costs can be higher.	More expensive as the processing engine requires real-time, faster processing capabilities. Less expensive when it comes to data storage.
Use cases	Suited for applications like payroll, billing, data warehousing, report generation, etc., that need to be processed on a regular schedule.	Suited for applications like customer behavior analysis, fraud detection, log monitoring, and alerting.

Key benefits of data streaming

There are several benefits that data streaming technologies bring to any business. Following are some examples:

Provide real-time business analytics and insights

Making quick, accurate and informed decisions brings many competitive advantages for businesses in the current fast-paced environment. Data streaming helps realize that by:

Enabling data analysis.
Providing important real-time business insights.

This capability allows businesses to respond, adapt to changes and make better-informed decisions. It is particularly helpful for fast-moving e-commerce, finance and healthcare industries.

Improve customer satisfaction

Data streaming helps organizations identify possible issues and provide solutions before they affect customers. For example, streaming logs can be analyzed in real-time to find errors and alert responsible parties. This capability allows businesses to provide uninterrupted service and avoid delays, improving customer satisfaction and trust.

Reduce storage cost

Data streaming does not require expensive hardware or infrastructure, as it processes and analyzes large volumes of data in real-time without storing them in expensive data warehouses.

Additionally, data is processed in small batches or records at a time. Thus, businesses have the flexibility and time to scale their data processing capabilities according to their needs.

(Know the difference between data lakes & data warehouses.)

Provide personalized recommendations

Data streaming helps businesses analyze customer behavior in real-time and provide personalized recommendations for customers. It can be useful in applications like e-commerce, online advertising and content streaming.

Challenges & limitations of data streaming

While data streaming brings many advantages to the business, there are also some challenges and limitations, such as:

Challenges for faster data processing and computations

Data streaming applications perform real-time processing by running the required computations over the data. There’s two big risks here:

Data can be inaccurate if the applications do not have sufficiently speedy computation capabilities.
Important information computed over data streams can be lost.

The requirement to maintain data consistency and quality

The streaming data should meet quality standards and be consistent enough to process data accurately without errors. It can be challenging to manage in real-time. Thus, low-quality data or data inconsistencies can result in inaccurate data analytics.

Data security requirements

Data streaming systems must be protected against cyberattacks, unauthorized access and data breaches. It can be challenging as data comes in real-time and, most of the time, has to be discarded after processing. The data streams require extra care, especially when the data is sensitive — PII or financial transactions — since they are common targets of cyber attackers.

Can become costly over time

While data streaming reduces storage costs, it can be expensive if you need to scale up to handle large volumes of data. Then, certain computations are more expensive to perform over streaming data. That makes streaming data a challenge for smaller organizations with limited budgets and resources.

Complexity can grow

Implementing and maintaining data streaming systems can be complex and may require specialized skills and expertise. Finding such resources can be challenging for some companies. Furthermore, it may take a significant amount of time to master those skills.

Efficiency and scalability requirements

Data streaming requires more system resources, such as processing power and memory. Systems must be scalable to handle large volumes of data. It can be a limitation for startups or smaller companies.

Platforms & frameworks used for data streaming

Many companies offer data stream processors to gather large volumes of streaming data in real-time, process it, and deliver it to multiple destinations. Some cloud providers also provide managed platforms and frameworks for handling and processing streaming data. Some popular data stream processors and platforms help organizations collect, process, and analyze data from multiple streaming sources.

Apache Kafka. A distributed streaming platform for building real-time data pipelines and streaming applications.
Amazon Kinesis. A fully managed service offered by AWS for analyzing streaming data such as application logs, video, audio, website clickstreams, etc.
Google Cloud Dataflow. A fully managed service offered by Google for batch and stream processing. It allows the implementation and execution of streaming data processing pipelines.
Apache Spark Streaming. An open-source Apache Spark platform extension that supports historical streaming data and provides processing support for other popular streaming apps like Kafka and Flume.
Azure Stream Analytics. A real-time data streaming and analytics service provided by Microsoft. It allows you to process and analyze large amounts of streaming data from various sources.
Apache Flink. An open-source framework that provides high-throughput, low-latency processing for batch processing, stream processing, and event-driven applications.
Apache Storm. A distributed real-time streaming platform widely used for use cases like continuous computation, machine learning, and real-time analytics.

(Our very own Splunk Data Stream Processor, a long time data streaming service, is no longer available for new sales, but there are other options available for bringing your data into Splunk.)

From data streams to data rivers

Data streaming is the technology that processes continuously generated data in real time. Today, numerous sources generate streaming data. Thus, it is critical to have an efficient streaming data processor in place for processing, analyzing, and delivering that analyzed data to multiple places. Data streaming differs from batch processing in terms of data volume, the way it is processed, latency, complexity, costs, and many other ways.

Data streaming offers several benefits, including improved customer satisfaction. However, there are also limitations, like the need to invest in processing power and security, and requirements to meet data quality and consistency. It can be challenging for smaller organizations with a limited budget. Today, several data streaming technologies are available.

FAQs about Data Streaming

Open All Close All

What is data streaming?

Data streaming is the continuous transfer of data at high speeds from a source to a destination, enabling real-time processing and analytics.

How does data streaming differ from batch processing?

Data streaming processes data in real time as it arrives, while batch processing collects and processes data in groups or batches at scheduled intervals.

What are common use cases for data streaming?

Common use cases for data streaming include fraud detection, real-time analytics, monitoring, alerting, and powering applications that require immediate insights.

What technologies are used for data streaming?

Technologies used for data streaming include Apache Kafka, Apache Pulsar, Amazon Kinesis, and Splunk Data Stream Processor.

Why is data streaming important?

Data streaming is important because it enables organizations to react to events as they happen, improving decision-making and operational efficiency.

Open All Close All

See an error or have a suggestion? Please let us know by emailing splunkblogs@cisco.com.

This posting does not necessarily represent Splunk's position, strategies or opinion.

Shanika Wickramasinghe

Shanika Wickramasinghe is a software engineer by profession and a graduate in Information Technology. Her specialties are Web and Mobile Development. Shanika considers writing the best medium to learn and share her knowledge. She is passionate about everything she does, loves to travel and enjoys nature whenever she takes a break from her busy work schedule. She also writes for her Medium blog sometimes. You can connect with her on LinkedIn.

Learn 4 Min Read

What’s Chaos Monkey? Its Role in Modern Testing

Chaos Monkey is an open-source tool that software developers can use to simulate chaos and test the resilience and reliability of their systems.

Learn 6 Min Read

Log Analysis: A Complete Introduction

Learn what log analysis is, explore key techniques and tools, and discover practical tips to effectively analyze system log files.

Learn 5 Min Read

What is EPSS?

Learn how the Exploit Prediction Scoring System (EPSS) tackles cybersecurity risks by modeling vulnerability probability and enhancing threat management.

About Splunk

The world’s leading organizations rely on Splunk, a Cisco company, to continuously strengthen digital resilience with our unified security and observability platform, powered by industry-leading AI.

Our customers trust Splunk’s award-winning security and observability solutions to secure and improve the reliability of their complex digital environments, at any scale.

Learn more about Splunk

Subscribe to our blog

Get the latest articles from Splunk straight to your inbox.

Connect with Splunk on X

Follow @Splunk

Connect with Splunk on Instagram

Follow @Splunk

See Splunk Perspectives blog for execs

Get Perspectives

Data Streaming: A Complete Introduction

What is data streaming?

Examples of streaming data sources

The importance of data streaming

Batch processing vs. stream processing

Key benefits of data streaming

Provide real-time business analytics and insights

Improve customer satisfaction

Reduce storage cost

Provide personalized recommendations

Challenges & limitations of data streaming

Challenges for faster data processing and computations

The requirement to maintain data consistency and quality

Data security requirements

Can become costly over time

Complexity can grow

Efficiency and scalability requirements

Platforms & frameworks used for data streaming

From data streams to data rivers

FAQs about Data Streaming

Related Articles

What’s Chaos Monkey? Its Role in Modern Testing

Log Analysis: A Complete Introduction

What is EPSS?

About Splunk

Subscribe to our blog

Connect with Splunk on X

Connect with Splunk on Instagram

See Splunk Perspectives blog for execs