First, static data and streaming data
Static data: In order to support decision analysis and construction of data warehouse system, in which a large amount of historical data that is stored in the static data.
Data stream: data to large, rapid, continuous flow in the form of time-varying arrival. (Example: real-time generated logs the user real-time trading information)
Data stream has the following characteristics:
(1), the data continue to arrive quickly, the potential size may be endless. (2), a number of data sources, the complex format. (3) large amount of data, but not very concerned about the store, once processed, or discarded, either archive storage (stored in the data warehouse). (4), focusing on the overall value of the data, not overly concerned about the individual data. (5), the data in reverse order, or incomplete, the system can not control the new sequence data elements to be processed arrives.
In the conventional data processing flow, always first collect data and then places the data in the DB. Then the processed data DB.
Flow calculation: In order to achieve the timeliness of data, real-time consumption data acquisition.
Second, the bulk flow calculation and calculation
Bulk is calculated: static data sufficient time to process, such as Hadoop. Less demanding real-time.
Flow Calculation: mass data acquired in real time from various data sources, real-time analysis after treatment to obtain valuable information (real-time, multi-data structure, mass).
Stream Computing adhering to a basic idea that the value of data over time lowered, such as user clickstream. Therefore, when an event occurs it should be treated immediately instead of cached batch processing. Complex format data stream, a number of sources, a huge amount of data, is not suitable for computing quantities, must be calculated in real time, the response time is several seconds, real-time requirements. Batch throughput computing concern, interest is calculated real-time stream.
Flow calculation features:
1, real-time (realtime) and unbounded (unbounded) data stream. Flow calculation is calculated in real time and face streaming, the stream flow data is calculated subscription and consumption takes place in chronological order. And due to continuous, long and continuous data flow occurs into the integrated data flow computing system. For example, access to the site click on the log stream, as long as it does not close the site click on the log stream has been kept and the traffic generated computing system. Therefore, for the flow system, the data is not real-time termination (unbounded) of.
2, continuous (Continuos) and computationally efficient. Stream computing is a “trigger event” computing model, trigger source is above the unbounded streaming data. Once a new data stream into the stream computing, stream computing and immediately initiate a computing tasks, so that the entire stream computing is ongoing.
3, and real-time data stream (Streaming) integration. Triggering a data stream flow calculation results can be written directly to the data storage purposes, for example, report data is written directly to an RDS calculating reports show. Thus the calculation result as the streaming data can be continuously written stream data similar data storage purposes.
Third, the calculated flow frame
In order to timely processing of streaming data, it requires a low delay, scalable, highly reliable processing engine. For a flow computing system, it should meet the following requirements:
- High performance: the basic requirements of large data, such as processing hundreds of thousands of data per second.
- Massive formula: TB-level support and even the size of PB-level data.
- Real-time: to ensure low latency time to achieve the second level, or even milliseconds.
- Distributed: support large data base architecture must be able to smooth extension.
- Ease of use: to quickly develop and deploy.
- Reliability: stream data reliably.
There are three common framework and stream computing platforms: commercial-grade stream computing platform, open source stream computing framework, stream computing framework to support the company’s business development itself.
(1) Commercial grade: InfoSphere Streams (IBM) and StreamBase (IBM).
(2) Origin calculation opening frame, represent the following: Storm (Twitter), S4 (Yahoo).
(3) computing framework to support the company’s own business development stream: Puma (Facebook), Dstream (Baidu), Galaxy stream data processing platform (Taobao).
Fourth, the flow computing framework Storm
Twitter Storm is an open source distributed real-time big data processing framework, with the increasingly widespread application of stream computing, Storm visibility and increasing role. Next comes the core components of Storm and performance comparison.
Storm core components
- Nimbus: That’s Storm Master, responsible for resource allocation and task scheduling. A Storm cluster is only a Nimbus.
- Supervisor: namely Storm’s Slave, is responsible for receiving Nimbus task assigned to manage all Worker, a Worker Supervisor node contains multiple processes.
- Worker: work processes, each worker process has multiple Task.
- Task: tasks, each Spout and Bolt by a number of tasks (tasks) performed in Storm cluster. Each task corresponds to a thread of execution.
- Topology: computing topology, Storm topology is a package for real-time computing application logic, its role and MapReduce tasks (Job) is very similar, except that a MapReduce Job After getting the end result will always, and will always be a cluster topology run until you manually to terminate it. Topology may also be understood as a series of topology data flows (Stream Grouping) and Bolt interrelated consisting of Spout.
- Stream: data stream (Streams) is Storm in the core abstraction. It refers to a data stream created in parallel in a distributed environment, a set of tuples (tuple) processed unbounded sequence. Data stream may be capable of the expression pattern of one domain data stream of tuples (Fields) is defined.
- Spout: a data source (a Spout) is the source of the data stream topology. Spout tuple typically reads data from an external source and then sends them to the topology. According to different needs, it may be defined as a Spout reliable data sources, may be defined as the unreliable data sources. A reliable Spout able to re-send the tuple in the tuple it sends the process fails to ensure that all tuples can get the right treatment; corresponding to unreliable Spout will not be sent to after-tuple tuple any other processing. Spout can transmit a plurality of data streams.
- Bolt: topology data processing are all done by the Bolt. Data by filtering (filtering), the processing function (functions), the polymerization (aggregations), the coupling (joins), database interaction functions, Bolt can be virtually any kind of data needs to complete. Bolt can be a simple data stream conversion, and more complex data stream into a plurality of Bolt and typically require the use of multiple steps to complete.
- Stream grouping: To determine the topology of the input data stream each Bolt important part is to define a topology. Defines a packet stream partitioned manner in the data flow of Bolt different tasks (Tasks) in. There are eight built-in data stream are grouped in the Storm.
- Reliability: Reliability. Storm can be ensured by the topology of each tuple can send handled correctly. By tracking each tuple emitted from Spout whether the tuple consisting of the tree can be determined tuple has completed processing. Each topology has a “message delay” parameter, whether a tuple if Storm processing is completed, it will process the tuple is marked as failed, and re-transmitted at a later time in the tuple is not detected within the delay time .
(FIG. 1: Storm core components)
(FIG. 2: Storm programming model)
Contrast mainstream computing engine
The more popular real-time processing engines Storm, Spark Streaming, Flink. Each engine has its own characteristics and application scenarios. The following table is a simple comparison of these three engines.
(FIG. 3: Performance Comparison main engine)
Summary: The emergence of stream computing expands our ability to deal with complex real-time computing needs. Storm as a flow calculation tool, which greatly facilitated our application. Flow calculation engine still evolving, has greatly improved JStorm, Blink-peer computing engine based Storm Flink and development in all aspects of performance. Stream Computing worthy of our continued attention.
Author: Yao Yuan
Source: CreditEase Institute of Technology