理解storm的并行执行,workder,executor,task的关系以及调度算法

官方对storm中worker,executor,task讲解非常清楚,https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology  转载到个人博客上。一图胜千言:

Storm distinguishes between the following three main entities that are used to actually run a topology in a Storm cluster:

  1. Worker processes
  2. Executors (threads)
  3. Tasks

Here is a simple illustration of their relationships:

worker process executes a subset of a topology. A worker process belongs to a specific topology and may run one or more executors for one or more components (spouts or bolts) of this topology. A running topology consists of many such processes running on many machines within a Storm cluster.

An executor is a thread that is spawned by a worker process. It may run one or more tasks for the same component (spout or bolt).

task performs the actual data processing — each spout or bolt that you implement in your code executes as many tasks across the cluster. The number of tasks for a component is always the same throughout the lifetime of a topology, but the number of executors (threads) for a component can change over time. This means that the following condition holds true: #threads ≤ #tasks. By default, the number of tasks is set to be the same as the number of executors, i.e. Storm will run one task per thread.

Configuring the parallelism of a topology

Note that in Storm’s terminology "parallelism" is specifically used to describe the so-called parallelism hint, which means the initial number of executor (threads) of a component. In this document though we use the term "parallelism" in a more general sense to describe how you can configure not only the number of executors but also the number of worker processes and the number of tasks of a Storm topology. We will specifically call out when "parallelism" is used in the normal, narrow definition of Storm.

The following sections give an overview of the various configuration options and how to set them in your code. There is more than one way of setting these options though, and the table lists only some of them. Storm currently has the following order of precedence for configuration settingsdefaults.yaml < storm.yaml < topology-specific configuration < internal component-specific configuration < external component-specific configuration.

Number of worker processes

  • Description: How many worker processes to create for the topology across machines in the cluster.
  • Configuration option: TOPOLOGY_WORKERS
  • How to set in your code (examples):

Number of executors (threads)

  • Description: How many executors to spawn per component.
  • Configuration option: ?
  • How to set in your code (examples):

Number of tasks

Here is an example code snippet to show these settings in practice:

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout);

In the above code we configured Storm to run the bolt GreenBolt with an initial number of two executors and four associated tasks. Storm will run two tasks per executor (thread). If you do not explicitly configure the number of tasks, Storm will run by default one task per executor.

Example of a running topology

The following illustration shows how a simple topology would look like in operation. The topology consists of three components: one spout called BlueSpout and two bolts called GreenBolt and YellowBolt. The components are linked such that BlueSpout sends its output to GreenBolt, which in turns sends its own output to YellowBolt.

The GreenBolt was configured as per the code snippet above whereas BlueSpout and YellowBolt only set the parallelism hint (number of executors). Here is the relevant code:

Config conf = new Config();
conf.setNumWorkers(2); // use two worker processes

topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // set parallelism hint to 2

topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
               .setNumTasks(4)
               .shuffleGrouping("blue-spout");

topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
               .shuffleGrouping("green-bolt");

StormSubmitter.submitTopology(
        "mytopology",
        conf,
        topologyBuilder.createTopology()
    );

And of course Storm comes with additional configuration settings to control the parallelism of a topology, including:

  • TOPOLOGY_MAX_TASK_PARALLELISM: This setting puts a ceiling on the number of executors that can be spawned for a single component. It is typically used during testing to limit the number of threads spawned when running a topology in local mode. You can set this option via e.g.Config#setMaxTaskParallelism().

How to change the parallelism of a running topology

A nifty feature of Storm is that you can increase or decrease the number of worker processes and/or executors without being required to restart the cluster or the topology. The act of doing so is called rebalancing.

You have two options to rebalance a topology:

  1. Use the Storm web UI to rebalance the topology.
  2. Use the CLI tool storm rebalance as described below.

Here is an example of using the CLI tool:

# Reconfigure the topology "mytopology" to use 5 worker processes,
# the spout "blue-spout" to use 3 executors and
# the bolt "yellow-bolt" to use 10 executors.

$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

References for this article

时间: 2024-09-29 00:17:20

理解storm的并行执行,workder,executor,task的关系以及调度算法的相关文章

storm源码之理解Storm中Worker、Executor、Task关系【转】

[原]storm源码之理解Storm中Worker.Executor.Task关系 Storm在集群上运行一个Topology时,主要通过以下3个实体来完成Topology的执行工作:1. Worker(进程)2. Executor(线程)3. Task 下图简要描述了这3者之间的关系:                                                    1个worker进程执行的是1个topology的子集(注:不会出现1个worker为多个topology服

用实例的方式去理解storm的并行度

什么是storm的并发度 一个topology(拓扑)在storm集群上最总是以executor和task的形式运行在suppervisor管理的worker节点上.而worker进程都是运行在jvm虚拟机上面的,每个拓扑都会被拆开多个组件分布式的运行在worker节点上. 1.worker 2.executor 3.task 这三个简单关系图: 一个worker工作进程运行一个拓扑的子集(其实就是拓扑的组件),每个组件的都会以executor(线程)在worker进程上执行,一个worker进

【问底】陈焕生:深入理解Oracle 的并行执行

摘要:Oracle并行执行是一种分而治之的方法.执行一个sql 时,分配多个并行进程同时执行数据扫描,连接以及聚合等操作,使用更多的资源,得到更快的sql 响应时间.并行执行是充分利用硬件资源,处理大量数据时的核心技术. Oracle并行执行是一种分而治之的方法.执行一个sql 时,分配多个并行进程同时执行数据扫描,连接以及聚合等操作,使用更多的资源,得到更快的sql 响应时间.并行执行是充分利用硬件资源,处理大量数据时的核心技术. 在本文中,在一个简单的星型模型上,我会使用大量例子和sql m

storm中worker、executor、task之间的关系

理清一下worker.executor.task.supervisor.nimbus.zk这几个之间的关系 先来看一张图 (图片来自:http://www.cnblogs.com/foreach-break/p/storm_worker_executor_spout_bolt_simbus_supervisor_mk-assignments.html) 首先从微观上来看:worker即进程,一个worker就是一个进程,进程里面包含一个或多个线程,一个线程就是一个executor,一个线程会处理

【原】【译文】理解storm拓扑并行度

原文地址: http://storm.apache.org/releases/1.2.1/Understanding-the-parallelism-of-a-Storm-topology.html 什么构成一个运行的拓扑:工作进程,执行器和任务 storm区分以下三个用于在Storm集群中实际运行拓扑的主要实体: 1. 工作进程2. 执行器(线程)3. 任务 这是他们的关系的一个简单的说明 [译者理解:1个工作进程(worker)可包括1或多个执行器(executor/thread),1个执行

理解Storm可靠性消息

看过一些别人写的, 感觉有些东西没太说清楚,个人主要以源代码跟踪,参考个人理解讲述,有错误请指正. 1基本名词 1.1 Tuple: 消息传递的基本单位.很多文章中介绍都是这么说的, 个人觉得应该更详细一点. 在spout发送的时候,函数原型 public List<Integer> emit(List<Object> tuple, Object messageId) {        return emit(Utils.DEFAULT_STREAM_ID, tuple, mess

深入理解Activity启动流程(四)–Activity Task的调度算法

本系列博客将详细阐述Activity的启动流程,这些博客基于Cm 10.1源码研究. 深入理解Activity启动流程(一)--Activity启动的概要流程 深入理解Activity启动流程(二)--Activity启动相关类的类图 深入理解Activity启动流程(三)--Activity启动的详细流程1 深入理解Activity启动流程(三)--Activity启动的详细流程2 前面两篇博客介绍了Activity的详细启动流程,提到ActivityStack类的startActivityU

我为什么要理解storm的一些概念

本文翻译自官方文档:http://storm.apache.org/documentation/Concepts.html. Topology,拓扑:类似MapReduce的Job.一个重要区别是MR的任务通常有结束,然而拓扑是一直运行下去的.在后端,拓扑就是一个Thrift结构体(structure),因此可以通过任何语言编写拓扑.Java提供了TopologyBuilder工具类,来帮助组装拓扑. Stream,流:流是一连串的tuple,在定义时会同时指定schema,这个schema定义

用实例理解Storm的Stream概念

原文首发在个人博客:http://zqhxuyuan.github.io/2016/06/30/Hello-Storm/ 如需转载,请注明出处,谢谢! 缘起 事情源于在看基于Storm的CEP引擎:flowmix 的FlowmixBuilder代码, 每个Bolt设置了这么多的Group, 而且declareStream也声明了这么多的stream-id, 对于只写过WordCountTopology的小白而言, 直接懵逼了,没见过这么用的啊,我承认一开始是拒绝的,每个Bolt都设置了这么多Gr