Here at Streamlio we're currently hard at work building the world's first enterprise-grade, end-to-end real-time platform. With the release of the Streamlio Sandbox preview we provided an initial glimpse of what we’ll be rolling out, but there’s much more on the way soon.
We've also begun to provide an ever-more detailed picture of the technologies that serve as the building blocks for our real-time platform and why. You can see that discussion in a series of blog posts on Apache Pulsar (incubating), Heron, and Apache BookKeeper (blog posts coming soon), the three core systems that comprise Streamlio and provide its durable messaging, compute, and stream storage capabilities, respectively.
What we haven't yet done is paint in broader strokes and lay out what we think a real-time platform is and what we think it should provide. In this post I'll continue to flesh out the durable messaging, compute, and stream storage pieces as we've done in previous posts but I'll step back a bit and talk about these integral pieces with an eye to what they offer to the greater whole.
Messaging is the nervous system of a real-time platform. The messaging system connects components together and serves as the fundamental data pipeline not just within the platform, providing communication between system components, but into and out of the platform as well, enabling interaction from and between applications that target the platform.
But a nervous system is of limited value without memory. If a message can enter the platform and get dropped for any reason then the entire platform is deficient, plain and simple. The platform is also deficient if system failures—hardware failure, network partitions, and so on—lead to unprocessed messages simply vanishing into the ether. Why is this so crucial? Because the mission of a real-time platform is to perform some specified set of operations for every request that enters the system. For real-time computations (more on compute below), especially stateful computations, the outcome of each request may impact the outcome of future requests. Even one failure can have consequences that fan out onto an indefinite number of future computations. Durable messaging is the solution to this problem.
Imagine a use case like real-time auction pricing. When someone makes a bid on an item, this single request impacts how the pricing system responds to future bids for the item. If someone bids $100 while no one else has bid over $80, then dropping that bid introduces an unnecessary distortion into the auction (not to mention that the seller loses money directly). Dropping bids may also mean failing to collect information about user behavior and failing to gather information that could inform entire future auctions.
For Streamlio, we've chosen Apache Pulsar (incubating) because we needed more than just as a trustworthy set of “pipes” through which messages can be passed. We needed a system that prevents scenarios like the auction pricing scenario above. We knew we could rely on Pulsar because it alone uses Apache BookKeeper, a production-ready log storage system, for durable message storage, which enables it to guarantee zero data loss even under significant traffic in production. The messaging component of a real-time platform needs to provide both a multi-faceted data pipeline and guaranteed message durability or else it quite simply fails to fulfill even the most lax of real-time requirements.
In one sense, the role of the compute layer is straightforward enough. A real-time platform needs to be more than just a set of channels letting data flow from one place to another. It has to do some kind of real computational work, be it in the form of applying complex algorithms, detecting anomalous patterns, aggregating data from different sources, sanitizing inputs, and so on. And so the compute layer clearly needs to be powerful and efficient so that it can produce near-real-time results (just a few milliseconds for each computation) based on real-time inputs.
From a more pragmatic workflow perspective, the compute layer also needs to be highly accessible to developers. It needs to provide readily comprehensible abstractions and a rock-solid deployment story that requires as few development resources as possible. Developers nowadays have gotten used to deploying applications on PaaSes like Heroku and on container-driven platforms like Kubernetes (and rightfully so!). Deploying to the compute layer of a real-time platform should involve simply packaging your processing job (for the Streamlio platform your Heron topology) into some single artifact and running a simple command, and absolutely no more than that.
In other words, the compute layer of a real-time platform should feel like using a PaaS, and it should easily accommodate large organizations with many teams and departments building, testing, tuning, and running hundreds or even thousands of compute jobs of varying complexity while also controlling costs by allowing for granular application-level resource management.
We chose Heron because we know for a fact that Heron can provide all of the above. Heron served Twitter's extremely demanding real-time compute needs for many years, backing services used by hundreds of millions of users (like the timeline service). There are quite simply precious few real-time needs that it hasn't already met in production.
The role of the stream storage layer is a bit trickier to grasp at first than messaging and compute because although we're all familiar with a wide variety of storage scenarios, from basic web app backends to HDFS to SQL-driven reporting systems, real-time has very specific storage requirements—requirements that aren't terribly well served by traditional databases, be they relational, "NoSQL," graph, or other. A storage system fit for a real-time platform has to do three things. It needs to:
- Provide durable storage for the messaging system (lost messages mean lost computations, which means lost revenue).
- Provide durable storage that both the compute layer and external applications can access via an intuitive interface that developers can readily understand. In fact, using this storage layer shouldn't really feel like using a traditional database at all; state storage and retrieval should be baked into the processing layer and largely handled by background processes.
- Enable versioning and distribution of real-time job artifacts (for use by applications or future jobs).
We chose Apache BookKeeper for Streamlio's storage layer because we're convinced that it's uniquely well suited for real-time requirements. Apache BookKeeper performs three roles in the Streamlio platform:
- BookKeeper quietly but deftly ensures that Streamlio provides effectively-once semantics with zero data loss (satisfying requirement #1). BookKeeper does this by providing lossless persistent message storage for Streamlio’s Pulsar-based messaging system.
- BookKeeper provides a storage system that your Streamlio-deployed compute applications (Heron topologies) can use as a simple but capacious storage backend for stateful processing (satisfying requirement #2).
- Streamlio's BookKeeper-based storage can be utilized not just by your compute jobs but also by applications that are external to the platform (also satisfying requirement #2).
With Streamlio, you also still have the freedom to use whatever databases you'd like with the Streamlio real-time platform if you’d like. You can store the results of computations as rows in PostgreSQL or JSON objects in MongoDB. We’ll be providing a sources and sinks interface that enables you to do this easily. But you should use things like database sinks only once you have a desired result from the compute layer itself. The compute layer should rely on BookKeeper-based stateful storage whenever possible, as BookKeeper is “baked in” and requires no extra deployment or management.
If you were building a surge pricing system, for example, Streamlio’s BookKeeper-backed storage layer would allow you to keep up-to-date, strongly consistent tallies of purchases within a geographical area, aggregated user preference data, information about geographical “hot spots,” and much more. In a full-fledged real-time platform that truly lives up to enterprise real-time requirements, these kinds of stateful compute jobs should be blazing fast and building them should require very few engineering resources.
You can see a visual representation of the complete Streamlio platform in Figure 1 below.
Figure 1. The unified Streamlio platform
Messaging, compute, and storage are inseparable from one another. Lose messaging and the compute layer has nothing to work with; lose compute and the platform passes data around but doesn’t perform any real work; lose storage and your compute jobs will have difficulty with stateful computations and your messaging system will drop data.
But Streamlio uniting best-of-breed technologies into a seamless whole, a full-fledged real-time platform that satisfies even the most exacting of requirements of even the most complex organizations and use cases is very much on the way. Not only that, large organizations will be able to deploy hundreds or even thousands of real-time applications using the Streamlio platform, and they’ll be able to build and deploy those applications with ease. This is what a real-time platform should provide, and no less.