Categories
Uncategorized

Batch unified stream processing flow principle to achieve one of the batch –Flink

Implementation techniques many batch from sql handle a variety of relational databases, data to large areas of MapReduce, Hive, Spark, and so on. These are the classic way to deal with a limited data stream. The Flink focus is unlimited stream processing, so how did he do it Batch?

Unlimited streaming: input data without end; data processing beginning from a certain point in time of the current or past, have been conducting ongoing

Another form of processing is called a current limiting process, i.e., processing started from a certain point of time, and ends at another point of time. The input data may itself be limited (ie, the input data set does not grow over time), for the purposes of analysis it may be artificially set a finite set (that is, only a certain period of time event analysis).

Clearly, the current limiting process is a special case of an infinite stream processing, it just stops it at some point. Further, if the calculation result is not performed continuously during generation, it generated only once at the end, that is a batch (batch data processing).

Batch processing is a very special case for stream processing. In stream processing, we define a sliding window or scroll window for the data and produce results each time the window slides or scrolls. Unlike batch processing, we define a global window where all records belong to the same window. For example, the following code represents a simple Flink program that counts visitors to a website every hour and groups by region.

val counts = visits   
.keyBy("region")   
.timeWindow(Time.hours(1))   
.sum("visits")

If you know the input data is limited, it can be achieved through the following batch codes.

val counts = visits   
.keyBy("region")   
.window(GlobalWindows.create)   
.trigger(EndOfTimeTrigger.create)   
.sum("visits")

Flink is unusual is that both can be treated as an infinite stream of data to process, you can also use it as a limited flow to deal with. Flink the DataSet API is designed for batch born, as shown below.

val counts = visits   
.groupBy("region")   
.sum("visits")

If the input data is limited, then the operating results of the code above code is the same as the previous period, but it is more friendly and accustomed to using the batch processor for programmers.

Fink batch mode

Flink supports both stream and batch processing through one underlying engine

On stream processing engine, Flink has the following mechanisms:

  • Checkpoint and state mechanisms: for fault-tolerant, stateful processing;

  • Watermark mechanism: used to implement an event clock;

  • Window and triggers: calculated for limiting the scope of the definition of the time and presenting results.

On the same flow processing engine, there is another set Flink mechanism for efficient batch.

    Backtracking for scheduling and recovery: Dryad introduced by Microsoft, now almost for all batch processor;

    Specific memory data structures used to hash and sort of: may, if desired, part of the data from the hard disk to a memory overflow;

    Optimizer: shorten the time as much as possible to generate results.

Two mechanisms respectively corresponding to a respective API (DataStream API and DataSet API); Flink When creating a job, and can not be obtained by mixing together all the functions while using the Flink.

In the latest version, Flink supports two relational API, Table API and SQL. Both batch and flow processing API is a unified API, which means that in the borderless flow of real-time data and historical data stream bounded, relational API queries with the same semantics perform, and produce the same result. With the SQL Table API and Apache Calcite to parse, verify and optimize queries. They can be seamlessly integrated with DataStream and DataSet API, and the scalar function supports user-defined, and the aggregate function table valued function.

Table API / SQL is a uniform way flow batch analytical use cases become the primary API.

API API DataStream main data-driven applications and data pipes.

In the long run, DataStream API should be bounded by a stream fully contained DataSet API.

Flink batch performance

MapReduce, Tez, Spark and Flink compare performance of pure batch job execution. Test batch jobs and distributed hash is TeraSort connection.

The first task is TeraSort, i.e., measured as the time to use to sort the data 1TB.

Is distributed scheduling problem on TeraSort Essentially, it consists of the following stages:

(1) read phase: partition data read from HDFS file;

(2) Local sorting phase: partition above the partial ordering;

(3) the shuffle phase: The data in the key re-distributed to the processing node;

(4) final sorting phase: generating a sorted output;

(5) writing stage: The sorted partitions written to HDFS file.

Hadoop release contains the implementation of TeraSort, the same can also be used to achieve Tez, because Tez can execute programs written by MapReduce API. Spark and Flink TeraSort implementation is provided by the Dongwon Kim. 42 for measuring a cluster composed of machine, each machine comprising a CPU core 12, 24GB memory, and hard disk 6.

The results show that sort of time Flink are less than all other systems. MapReduce took 2157 seconds, Tez took 1887 seconds, Spark took 2171 seconds, Flink is only 14.8 seconds.

The second task is distributed between the hash of a large data set (240GB) and a small data set (256MB) connection. The results showed that, Flink is still the fastest system, it uses the time were Tez and Spark 1/2 and 1/4.

The overall result of the above reasons, the implementation process is based on the flow Flink, which means that the various processing stages have more overlap, and shuffling operations are pipelined, so fewer disk access operation. Instead, MapReduce, Tez and Spark is based on the batch, which means that data must be written to disk before transmission over the network. This test demonstrates, in use Flink, the system idle time and fewer disk access operation.

It is worth mentioning that the performance results of the original value may vary depending on cluster setup, configuration and software version.

Thus, the same data can be processed Flink framework to handle infinite and finite data stream with the data stream, without sacrificing performance.

More Flink Related Articles:

Real-time calculation frame –Flink space shuttle of the Time

Flink Quick Start – Installing and running the example

King -Flink real-time processing of large data

Flink, Storm, SparkStreaming performance comparison

More real-time calculation, Flink, Kafka technical articles welcome attention to calculate real-time streaming

Leave a Reply