Storm Series (two) – Storm explain core concepts

A, Storm core concepts

1.1 Topologies (topology)

A full stream handler is called Storm Storm Topology (topology). It is one that is connected by Stream by the Spouts and Bolts directed acyclic graph, Storm will keep each submit to the topology of the cluster continues to run, in order to deal with a steady stream of traffic, until you take the initiative to kill (kill ) so far.

1.2 Streams (stream)

Stream is the core concept in the Storm. A Stream is an unbounded, and parallel processing in a distributed manner to create a Tuple sequence. Tuple may contain the most basic type, and data type definitions. In simple terms, Tuple is the actual carrier stream data, and Stream is a series of Tuple.

1.3 Spouts

Spouts is the source of the data stream, a Spout can send data to more than one of Streams. Spout reliable and unreliable usually divided into two types: a reliable Spout can resend Tuple in case of failure, unreliable Spout once the Tuple sent to ignore the.

1.4 Bolts

Bolts is a processing unit stream data, it can receive data from one or more of Streams, the processing is completed and then transmitted to the new Streams. Bolts may be performed by filtration (Filtering), the polymerization (Aggregations), connected (joins) and other operations, and can interact with the file system or database.

1.5 Stream groupings (group policy)

spouts and bolts on the cluster perform tasks are executed in parallel (as shown above, each circle represents a Task) by a plurality of Task. When a Tuple needs to be sent to Bolt B executed from Bolt A, to know how the program should be sent to Bolt B in which a Task enforce it?

This is caused by Stream groupings grouping strategy to decide, Storm follows in a total of eight built-in Stream Grouping. Of course, you can also implement custom Stream grouping strategy by implementing CustomStreamGrouping interface.

  1. Shuffle grouping

    Tuples randomly distributed to each Task each Bolt, each Bolt to get the same amount of Tuples.

  2. Fields grouping

    Streams are grouped by field (field) grouping specified. Tuples partition assumed by user-id field, the user-id of the same will be sent to the same Task.

  3. Partial Key grouping

    Streams are grouped by field (field) grouping specified, and Fields Grouping similar. However, the load balancing for two Bolt downstream, it can be better optimized in case the input data is not averaged.

  4. All grouping

    Streams will be replicated all of Bolt’s Tasks. Because of duplication of data processing, it is necessary to use caution.

  5. Global grouping

    Streams whole Bolt will enter one of the Task, usually into the id smallest Task.

  6. None grouping

    None grouping and Shuffle grouping current equivalent, they are randomly distributed.

  7. Direct grouping

    Direct grouping can only be used to direct streams. This way you need to specify Tuple directly from producers which are processed by Task.

  8. Local or shuffle grouping

    If there Tasks and objectives Bolt Bolt’s current Worker Tasks are in the same process, then the priority will Tuple Shuffled to target Bolt is on the Tasks of the same process, so you can minimize network traffic. Otherwise, consistent and common Shuffle Grouping behavior.

Two, Storm architecture Detailed

2.1 Nimbus process

Also known as Master Node, it is the overall commander Storm cluster work. The main functions are as follows:

    Thrift via the interface, and receives Topology Client monitor submitted;

    According to cluster resources Workers will Topology Client submitted assignments, assign the result is written Zookeeper;

    By Thrift interface Supervisor download request listening Topology code, and available for download;

    By Thrift interface to monitor UI reading of the statistics, read the statistics from the Zookeeper, returned to the UI;

    If after the process exits, immediately restart the machine, cluster operation is not affected.

2.2 Supervisor process

Also known as Worker Node, Storm cluster resource manager, started on demand Worker process. The main functions are as follows:

    The timing for new Topology code is not downloaded to the local, and regularly remove the old code from a Zookeeper Topology checks;

    According to Nimbus task allocation plan, needed to start one or more of the Worker processes in the machine, and monitor all of the Worker process conditions;

    If the process exits, immediately restart the machine, cluster operation is not affected.

2.3 zookeeper’s role

Nimbus and Supervisor course is designed for fast failure (the process of self-destruction encounter any unforeseen circumstances) and non-state (all states stored on disk or Zookeeper). The benefits of this design is that if their process is accidentally destroyed, after a reboot, you simply state before data can be acquired from the Zookeeper, and will not cause any data loss.

2.4 Worker Process

Task constructors Storm cluster, Task instance constructor Spoult or Bolt, start the Executor thread. The main functions are as follows:

    Launch one or more Executor threads in this process according to the task assigned on Zookeeper, and give the constructed Task instance to Executor to run;

    Writes a heartbeat to the Zookeeper;

    Maintaining the transmission queue is transmitted to the other of the Worker Tuple;

    If the process exits, immediately restart the machine, cluster operation is not affected.

2.5 Executor thread

Mandate holders Storm cluster loop executes Task code. The main functions are as follows:

    Performing one or more of the Task;

    Acker execution mechanisms responsible for sending Task processing status to the corresponding worker Spout is located.

2.6 parallelism

1 Worker process execution is a subset of the Topology of a Worker multiple Topology service will not happen, so a running Topology is composed of a plurality Worker processes in a cluster on multiple physical machines consisting of . 1 Worker process starts one or more threads to execute an Executor Topology of Component (components, namely Spout or Bolt).

Executor is a process to be launched Worker separate thread. Each Executor will run a Component in one or more of Task.

Task Component of the code units. Topology After starting, the number of Task Component 1 is fixed, but the number of threads that Component Executor used can be dynamically adjusted (e.g.: an Executor thread can execute one or more instances of the Task Component’s). This means that for a Component is, # threads <= #="" tasks="" (the="" number="" of="" threads="" less="" Task)="" this="" situation="" is="" present.="" The="" default="" Task="" Executor="" case="" equal="" to="" the="" that="" an="" run="" only="" one="" Task.<="" p=""/>

Summarized as follows:

    A running Topology by a cluster consisting of multiple Worker processes;

    By default, each Worker process starts a default Executor thread;

    By default, each Executor default Task start a thread;

    Task Component of the code units.

Reference material

  1. storm documentation -> Concepts

  2. Internal Working of Apache Storm
  3. Understanding the Parallelism of a Storm Topology
  4. Storm nimbus single node goes down processing

More big data series can be found GitHub open source project: Big Data Getting Started

Leave a Reply