Data quality monitoring made easy
Jul 16, 2021 • 11 min read
Jul 16, 2021 • 11 min read
As we already know, In-Stream Processing is a service that takes events as input and produces results that are delivered to other systems.
The architecture consists of several components:
A message queue serves several purposes for In-Stream Processing: it smoothes out peak loads; provides persistent storage for all events; or it can allow several independent processing units to consume the same stream of events.
Since a message queue initially collects events from all raw data sources, it has to be performant, scalable, and fault tolerant. That’s why it commonly runs on several dedicated servers which form the message queue cluster. The main concepts essential for understanding how highly scalable message queues work are Topics, Partitions, Producers, and Consumers.
Events ordering is guaranteed only inside one partition. Therefore, correct partition design for the original event stream is vital for business applications where message ordering is important.
Another important aspect of queue capabilities is reliability. Message queues must remain available for producers and consumers despite server failures or network issues, with minimal risks of data loss. To achieve that, data in every partition is replicated to multiple nodes of a cluster and persists several times per minute; see the diagram below. The efficient architectural design of these features is extremely important to keep the message queues highly performant.
In case of a failure at the consumer side it might be necessary to re-process data that was already read from the queue. Therefore, the capability to replay the stream starting at some point in the past becomes an essential component of overall reliability in a stream processing service.
An In-Stream Processing application can be represented as a sequence of transformations, as shown in the next diagram. Every individual transformation must be simple and fast. As these transformations are chained together in a pipeline, the resulting algorithms are powerful, as well as rapid.
New processing steps can be added to existing pipelines over time to improve the algorithms rather easily, leading to a fast development cycle of stream applications and extendability of the stream processing service.
At the same time, the transformations must be efficiently parallelizable to run independently on different nodes in a cluster, leading to a massively scalable design.
To assure this efficient parallelization, stream developers operate with two logical instruments: partitions and containers.
Developers need to define the logical model of parallelization by breaking computations into steps that are known as embarrassingly parallel computations. The process is illustrated in the diagram above. Sometimes it is actually necessary to rearrange the stream data in different partitions for different containers. This can be done by re-partitioning. However, developers, beware: re-partitioning is an expensive operation that slows the pipeline speed considerably and should be avoided or at least minimized if at all possible.
Once the model is defined, the application is written using APIs of a particular In-Stream Processing framework, usually in a high-level programming language such as Java, Scala or Python. The stream processing engine will do the rest.
While there are many different In-Stream Processing engines on the market, they mostly follow a very similar design and architecture. Typically, the streaming cluster consists of one highly available container manager and many worker nodes.
Containers are allocated to nodes based on resource availability, so new containers may be launched on any available node. If a node fails, the Container Manager will start up more containers on available nodes and re-run any events that may have been lost.
It is very important that one stream processing cluster can run many streaming applications simultaneously. Basically, an applications is simply a set of containers for the Container Manager. More applications lead to a bigger set of containers being served by the Container Manager.
Machine Learning involves “training” the algorithms, called models, on representative datasets to “learn” the correct computations. The quality of the models depends on the quality of the training datasets and suitability of the chosen models for the use case.
The general approach to machine learning involves three steps:
In-Stream Processing can use the trained models to discover insights. It is rarely used for the training process itself, as the majority of training algorithms do not perform well in the streaming architecture. There are some exceptions; for example “k-means” clustering in Spark streaming.
Time Series Analysis is an area of machine learning for which In-Stream Processing is a natural fit, since it is based on sliding windows over data series. A complication that must be considered carefully is data ordering, since streaming frameworks usually don’t guarantee ordering between partitions, and time series processing is typically sensitive to it.
In-Stream Machine Learning is a young, but highly promising, domain of computer science that is getting a lot of attention from the research community. It is likely that new machine learning algorithms will emerge that can be run efficiently by the stream processing engines. This would allow In-Stream systems to train the models at the same time as running them, on the same machinery, and improve them over time.
Data Ingestion is the process of bringing data into the system for further processing. Data Enrichment adds simple quality checks and data transformations such as translating an IP address into a geographical location or a User Agent HTTP header into the operating system and browser type used by a visitor while browsing a web site. Historically, data was first loaded into batch processing systems, then transformations were made. Nowadays, more and more designs unite Data Ingestion and Data Enrichment into a single In-Stream process, because a) enriched data can be used by other In-Stream applications; and b) end users of batch analysis systems see ready-for-usage data much faster.
The results of all preceding phases are either individual actionable insights picked out of the original stream or a whole data stream, transformed and enriched by the processing. As we have already discussed, the In-Stream Processing service is one component of a wider Big Data landscape. In the end, it produces data used by other systems. Here are several generic use cases that require different interfaces to deliver results of the In-Stream Processing service:
Sergey Tryuber, Anton Ovchinnikov, Victoria Livschitz