How does stream processing work?
Stream processing is a low latency way of capturing information about events while it’s in transit, processing data on the fly. A data stream, or event stream, can constitute nearly any type of information — social media or web traffic clickstream data, factory production data and other process data, stock market or financial transaction details, hospital patient data, machine learning system data, Internet of Things (IoT) device condition and other IoT data are all examples of information that is constantly being streamed back to the enterprise.
Architecturally, stream processing involves a framework of data processing software that sits in between the data source (the producers) and the users of that data (the consumers). This stream processing framework allows IT to decide which specific event streams and data feeds are important and how to use that data in the design of new streaming applications that serve a particular purpose.
Stream processing is most useful in cases that require data analytics — in which data is being produced continuously while changing over time. One of the key goals of stream processing is to monitor the data stream for anomalies and outliers and alert IT management if something is going awry. In an Internet of Things example, hundreds of industrial fan sensors may constantly feed their temperature and rotational speed to a logging database. The stream processing system can capture this streaming data as it is being transmitted, before it is stored in the database, giving management an immediate heads up if one of those fans begins to fail.
What is the difference between stream processing and batch processing?
Stream processing involves the real-time analysis of new data in motion, while batch processing involves a periodic analysis of static information.
In batch processing, data produced in the past and held in a file (such as a SQL database), is scoured, manipulated and reported upon when an analyst initiates the action. The processed information is typically known as data at rest, as it’s not actively changing or moving during analysis.
In contrast, stream processing involves real-time analytics of data in motion. Instead of being stored in a static file, streams of data are processed on the fly as they are generated. Because of its immediacy, stream processing generates a faster, up-to-the-moment picture of a device or a service.
For example, imagine a database that is regularly updated with log files produced by a matrix of sensors that monitor the production line and other systems on a factory floor. In a batch processing system, the database would be polled frequently, with regular reports about the aggregate condition of the sensors and any anomalies that might indicate a problem. For operations analysts and administrators, this information is only as current as the timestamp on the report that’s being run, which is only as current as the database’s most recent update. If the report is run hourly, that means the data could be old or outdated by the time it’s received.
In a stream processing system, stream data is captured and processed in flight, giving operations professionals a real-time look at the conditions on the factory floor. Instead of waiting an hour to discover that a machine has already failed, this information is delivered instantaneously, enabling them to respond proactively.
Also, because stream processing does not rely on the data stored in a database, it can be implemented without the storage requirements. As a low-latency solution, only the most relevant information in the data stream is picked out and stored permanently, while the data that isn’t immediately relevant is stored for future searches, security events or audits.
Stateful stream processing involves a type of information in which past data — comprising the data’s “state” — can influence new data and future data. When a web user’s session is logged, each opened link is dependent on the previous link, the one that brought that user to the current page. Conducting streaming analytics on the entirety of a user’s data flow requires keeping the state of the user’s full session in mind from beginning to end. A credit card transaction, for example, generally involves stateful streams of data, as detailed information about the purchase must be kept in memory until the transaction is complete and fraud detection has established the transaction as legitimate.
Stateful stream processing adds a significant extra layer of complexity because state information must be managed for multiple or distributed streams simultaneously. If a stream processor is tasked with monitoring users on a busy website, the data processing system may have to monitor the state for thousands of user sessions all at once. This increases the resource requirements for the network monitoring system considerably and introduces new and complex requirements to the algorithms used to create it. For example, if a stateful stream processing application crashes, a mechanism must be designed to ensure that state information is not lost.
While stateful stream processing is concerned with the overall state of the data (as discussed in the prior section), stateless stream processing is not.
In a stateful stream processing environment, information about past events is used as part of the analysis of current events. For example, temperature readings of an industrial machine are more useful if they are considered in the aggregate and over time, so trends can be spotted as they are developing.
In stateless stream processing, data is analyzed only as it arrives, without regard for state or previous information. If you simply want a real-time feed of the ambient temperature, without regard for how the temperature is changing, a stateless stream processing system will do the job. However, if you want to forecast future temperatures based on how the temperature has been changing over time, a stateful stream processing system will be required.
Stateful stream processing is considerably more complex to code, operate and scale. The more streams that are managed and the bigger the data volumes that each stream produces, the more complex and resource-intensive the stateful stream processing system becomes. However, because stateful stream processing generates far more useful insights than stateless stream processing, it is the dominant form of stream processing in the enterprise today.
A stream processing framework is an end-to-end processing system that provides a dataflow pipeline that accepts streaming inputs for processing while generating useful, real-time analytics. These frameworks are designed to simplify the development of stream processing and event stream processing software used for data streaming (discussed below). With a stream processing framework in place, a developer can quickly include functions from an existing library of tools and avoid having to develop an entire stream processing system from scratch.
Different frameworks can be used depending on the specific enterprise environment and use case. Several of these frameworks — both commercial and open source — are currently used in the enterprise. And while various special-purpose stream processing frameworks have been developed, general-purpose stream processing frameworks tend to be the most popular.
Regardless of the stream processing engines used, the stream processing framework’s job is to accept a pipeline of data as an input, process that data, and send results to an output queue typically known as a sink. The stream processing framework will also include its own programming model, a processing system that defines how to interact with and partition data, as well as how to manage data states, error management controls, and other things.
What is stream processing software?
While a stream processing framework provides the backbone for your analytics, stream processing software (or a stream processing application) built on top of that framework is responsible for the actual analysis. Using a stream processing framework saves time and effort in coding various streaming applications.
Some common stream processing software use cases include:
- Data collection, including multi-cloud integration and stream and message business data.
- Enterprise-wide delivery, including data distribution and data drift monitoring and detection
- In-stream anomaly detection
- In-stream aggregation
- In-stream enrichment and rule-based detection
- Data compliance
- Financial fraud detection; spotting criminal activity in real time
- System monitoring; enabling real-time analysis and predictive maintenance of server hardware, the network, applications, or industrial equipment
- Algorithmic, high-speed securities trading
- Supply chain monitoring and management
- Network intrusion detection
- Analysis of marketing and advertising campaigns; tracking customer activity in real time
- Vehicle traffic monitoring and abatement
While stream processing in a big data ecosystem works essentially the same way as it does in any other environment, stream processing in big data offers a few special advantages. Among these is stream processing’s ability to conduct data analytics without having to access the entire data store. Because big data by definition involves massive databases of unstructured data, batch processing of an enormous big data store is often tediously slow. Stream processing provides a convenient workaround that can generate real-time insights from big data, typically in a matter of milliseconds. Also, since this data is constantly changing, a big data store is never wholly complete in a batch process. Once the batch processing has started, the underlying data will continue to change, so any batch report will be out of date once it is complete. Stream processing offers a smart solution for this situation and other complex event processing.
Batch processing is still valuable in a big data scenario, specifically when long-term, detailed insights are required, which can only be obtained through a complete analysis of the entire data store. When faster and more timely analytics are required, stream processing is the better solution.
Pulsar is a distributed publish and subscribe messaging system that provides very low publish and end-to-end latency, guaranteed message delivery, zero data loss, and infinite data retention with tiered storage.
It was first developed at Yahoo in 2013 to serve as the messaging platform for critical applications such as Yahoo Finance, Yahoo Mail and Flickr. It was added to the Apache Foundation in 2016 and quickly became a top-level project with a vibrant and growing community.
Traditional messaging systems such as Kafka have taken the approach of co-locating their data processing and data storage on the same node. While this monolithic design does provide some performance benefits around serving data from local disk, it also negatively impacts scalability and resilience.
The programming model behind Pulsar Functions is very straightforward. Pulsar Functions receive messages from one or more input topics. Every time a message is published to the topic, the function code is executed. Upon being triggered, the function code executes its processing logic upon the incoming message and writes its output to an output topic. It is possible to have the output topic of one Pulsar Function be the input topic of another, allowing us to effectively create a directly acyclic graph (DAG) of Pulsar Functions.
Pulsar Functions are lightweight compute processes that consume messages from one or more Pulsar topics, apply a user-supplied function (processing logic) to each incoming message, and publish the results to one or more Pulsar topics.
The programming model behind Pulsar Functions is very straightforward. Pulsar Functions receive messages from one or more input topics. Every time a message is published to the topic, the function code is executed. Upon being triggered, the function code executes its processing logic upon the incoming message and writes its output to an output topic. It is possible to have the output topic of one Pulsar Function be the input topic of another, allowing us to effectively create a directly acyclic graph (DAG) of Pulsar Functions
How do I get started with stream processing?
Stream processing generally begins with the enterprise considering the available stream processing frameworks — with a particular eye toward how well they fit in a specific environment and how well they might serve various use cases. Both open source and commercial frameworks are available, which you may need to consider based on the level of support required by your organization.
Some of the questions to ask during your evaluation are likely to include:
- Do you want a data processing framework that includes both batch and stream processing capabilities?
- Does the framework support programming languages in which your staff already has competence?
- How scalable is the overall framework, and can it be clustered as the enterprise grows?
- Is stateful stream processing supported, and how scalable is its state management implementation?
- What are the framework’s approaches to fault tolerance, and how well does it recover from failures or crashes?
- How complex is the platform to develop upon? How hard will it be for developers to get up to speed and actively write code for the framework?
- Is the system agnostic in regards to data sources? Is any cloud service provider supported?
- What system and network monitoring tools are available?
- Is the platform being actively maintained and developed? How quickly are bugs remediated?
Stream processing offers IT managers a new and exciting way to analyze information without having to directly query a database or other data store. This concept creates powerful opportunities for data analysis, including real-time analytics for streaming applications rather than an outdated look at the past. Stream processing, by its nature, is nearly instantaneous to run, enabling you to avoid lengthy delays when running a report against a large database.
The larger and more complex your data streams, the more likely it is that stream processing can benefit your organization. That said, any enterprise with a need for real-time analysis of a stream of information — whether that data is a factory sensor feed, a flow of credit card transactions or something else altogether — can likely benefit from a stream processing system.