Overview of In-Stream Processing solutions on the market
This post contains a brief survey of better-known products related to In-Stream Processing that are available on the market at the time of this writing. In this survey, we focus specifically on critical architectural differentiations, rather than functional differences, that affect why customers may choose one approach over the others.
Specifically, we focus on the following big questions asked from the point of view of the customer’s chief architect:
- Build or buy: Do I want to design the overall system from components in-house and have freedom to evolve it over time, or do I buy a vendor product and live with its constraints?
- Importance of open source: Do I want to rely on the open source community as a source of innovation and hire developers from the market who love to work on open source stacks? Or do I prefer to rely on a vendor to guarantee product quality, to provide a “single throat to choke”, and to bring their integrators and MSPs to implement and manage the product?
- Cloud lock-in: Am I willing to commit to a specific cloud platform vendor and have a solution that’s not portable to other infrastructures? Since In-Stream Processing APIs work in conjunction with messaging, lookup stores, operational stores, and other APIs, this choice dictates the selection of that cloud provider for other computational needs.
- Big Data platform lock-in: Am I choosing In-Stream Processing as a “feature” of a complete, enterprise-grade commercial Big Data platform (probably because my company already has such a platform), or I would like to treat In-Stream Processing as a self-contained, loosely-coupled service that integrates with any Big Data infrastructure?
- Blueprint or clean slate: Do I want to reuse an integrated design blueprint and recommended production configurations developed by a professional services vendor based on multiple successful large-scale implementations of In-Stream Processing systems or rely on an in-house team to design all aspects of the system from scratch?
The choices are captured in the following decision flow diagram:
Why Big Data vendor lock-in is a concern
Many vendors offer In-Stream Processing as a “feature” of a broader Big Data processing platform rather than as a separate service that is loosely coupled with their Big Data platform and, therefore, can be integrated with other Big Data platforms and services. For many customers, tight coupling of the In-Stream and Big Data processing platforms is not practical because technology decisions about Data Warehouses, Data Lakes and Batch Analytics are made at different times, by different organizations, based on different selection criteria than those used to choose an In-Stream Processing platform. Even if a comprehensive Big Data platform is already in place, the choice of a stream processing feature for that platform shouldn’t be predefined, since a standalone, self-sufficient In-Stream Processing product may fit actual and prospective business requirements much better.
Why cloud vendor lock-in is a concern
Big Data applications are big drivers of cloud infrastructure adoption, so it should not surprise anyone that all major cloud providers are investing heavily in Big Data APIs in general, and streaming APIs in particular. Choosing a specific cloud vendor for streaming APIs has several compelling advantages, including speed of implementation, SaaS consumption and delivery model, and integration with other APIs of the cloud platform. The major concern, of course, is the implication of that choice: in all likelihood, getting out of that cloud platform later will not be practical without massive costs.
The choice to pick a specific cloud API should not be made lightly. If your company has already made a strategic commitment to a specific cloud and its APIs, it might be a moot point. The APIs of that cloud provider should be considered the default choice because, presumably, that’s why you chose that provider. However, if your company has not yet made such a commitment or has adopted a more balanced multi-cloud strategy, cloud portability is an essential consideration. The preferred choice would most likely be open source technologies or vendor products that can be deployed and run on any cloud with minimal operational implications.
What’s in the blueprint?
Blueprints, sometimes called reference architectures, can be powerful accelerators and enablers for companies that have decided to build their own systems using open source technologies deployable on any cloud rather than to buy vendor products.
Such companies are Grid Dynamics’ traditional customers. They face a substantial battle to figure out:
- Which APIs to choose from in a large (and growing) pool of alternatives
- How to integrate all the components into a working system
- How to configure each component separately, and the whole system together, for required levels of performance, availability, scalability, and durability
- Which tools to use for deployment, monitoring, logging, and visualization
- How to provision development and testing environments to support release pipeline
- And many more...
Knowing how to make the right design choices and how to answer these and similar questions is our business. Grid Dynamics is an engineering services company specializing in Big Data in general, and In-Stream Processing in particular, using open source technologies and cloud environments.
Beside working on customer projects, we have a research lab where our architects work to identify repeatable business use cases that can be addressed with repeatable design patterns and work to turn these design patterns into reusable blueprints. These blueprints are our intellectual property, and we make them freely available to our user community.
When a blueprint matches the business use case closely, the time-to-market can be 30% to 50% faster than starting from scratch. That’s because a lot of design choices have already been made, tools pre-integrated, and environments pre-configured. Making modifications to a proven design is much faster than creating a brand-new design.
Grid Dynamics makes money by consulting on design modifications, providing implementation services, and managing the resulting systems according to SLAs. This works well for our customers, who can rely on us as a design, implementation, and managed services partner to supplement their in-house teams — using 100% open solutions developed in a fraction of the time and at a fraction of the cost of proprietary alternatives. Needless to say, this works well for us, too; we get to monetize our experience and research by providing value other vendors can’t. And if a customer chooses to use our blueprints without our help, we are still delighted, as that’s how we gain loyal friends in high places.
For all these reasons, we have created a blueprint called In-Stream Processing Service that will be described in detail
- Apache Kafka Documentation
- Spark streaming documentation
- Apache Cassandra documentation
- Redis documentation
Other In-Stream Vendors
Sergey Tryuber, Anton Ovchinnikov, Victoria Livschitz
Big DataIn-Stream Processing