[Storm] 并发度的理解

Tasks & executors relation

Q1. However I‘m a bit confused by the concept of "task". Is a task an running instance of the component(spout or bolt) ? An executor having multiple tasks actually is saying the same component is executed for multiple times by the executor, am I correct ?

A1: Yes, and yes

task只是某个component(spout或者bolt)的实例。Executor线程在执行期间会调用该task的nextTuple或execute方法

Q2. Moreover in a general parallelism sense, Storm will spawn a dedicated thread(executor) for a spout or bolt, but what is contributed to the parallelism by an executor(thread) having multiple tasks ?

A2: Running more than one task per executor does not increase the level of parallelism -- an executor always has one thread that it uses for all of its tasks, which means that tasks run serially on an executor.

运行多个task并不会增加并行度,因为一个executor只是一个线程,这意味着它会顺序执行所有的task

  • The number of executor threads can be changed after the topology has been started (see storm rebalance command).
  • The number of tasks of a topology is static.

And by definition, there is the invariant of #executors <= #tasks.

一个topology的task个数是固定的,但是executor数(线程数)是可以动态改变的。默认的,executor数 <= tasks数

So one reason for having 2+ tasks per executor thread is to give you the flexibility to expand/scale up the topology through the storm rebalance command in the future without taking the topology offline. For instance, imagine you start out with a Storm cluster of 15 machines but already know that next week another 10 boxes will be added. Here you could opt for running the topology at the anticipated parallelism level of 25 machines already on the 15 initial boxes (which is, of course, slower than 25 boxes). Once the additional 10 boxes are integrated you can then storm rebalancethe topology to make full use of all 25 boxes without any downtime.

Another reason to run 2+ tasks per executor is for (primarily functional) testing. For instance, if your dev machine or CI server is only powerful enough to run, say, 2 executors alongside all the other stuff running on the machine, you can still run 30 tasks (here: 15 per executor) to see whether code such as your custom Storm grouping is working as expected.

一个executor运行2+task数的情况通常有:

  • 为了给topology运行提供多大的灵活度,在运行中可以扩展并发度
  • 为了功能测试

In practice we normally we run 1 task per executor.

PS: Note that Storm will actually spawn a few more threads behind the scenes. For instance, each executor has its own "send thread" that is responsible for handling outgoing tuples. There are also "system-level" background threads for e.g. acking tuples that run alongside "your" threads. IIRC the Storm UI counts those acking threads in addition to "your" threads.

实际上我们通常是 executors数 = task数

Reference

http://stackoverflow.com/questions/17257448/what-is-the-task-in-storm-parallelism

http://www.cnblogs.com/yufengof/p/storm-worker-executor-task.html

http://storm.apache.org/releases/0.9.6/Understanding-the-parallelism-of-a-Storm-topology.html

时间: 2024-08-25 09:55:17

[Storm] 并发度的理解的相关文章

Storm并发度和Grouping方式

Storm并发度和Grouping方式 .note-content {font-family: "Helvetica Neue",Arial,"Hiragino Sans GB","STHeiti","Microsoft YaHei","WenQuanYi Micro Hei",SimSun,Song,sans-serif;} .note-content h2 {line-height: 1.6; colo

关于Storm 中Topology的并发度的理解

来自:http://blog.csdn.net/derekjiang/article/details/9040243 概念理解 原文中用了一张图来说明在一个storm cluster中,topology运行时的并发机制. 其实说白了,当一个topology在storm cluster中运行时,它的并发主要跟3个逻辑实体想过:worker,executor 和task 1. Worker 是运行在工作节点上面,被Supervisor守护进程创建的用来干活的进程.每个Worker对应于一个给定top

Storm并发度详解

工作进程(Worker Process) Worker是Spout/Bolt中运行具体处理逻辑的进程.拓扑跨一个或多个Worker进程执行.每个Worker进程是一个物理的JVM和拓扑执行所有任务的一个子集.例如,如果合并并行度的拓扑是300,已经分配50个Worker,然后每个Worker将执行6个任务,Storm会尝试在所有Worker上均匀的发布任务. 执行器(Executor) Executor称为物理线程,每个Worker可以包含多个Executor. 任务(Task) Task是具体

Storm并发度

并发度 一个Topology可以包含一个或多个worker(并行的跑在不同的machine上), 所以worker process就是执行一个topology的子集, 并且worker只能对应于一个top  ology 一个worker可用包含一个或多个executor, 每个component (spout或bolt)至少对应于一个executor, 所以可以说executor执行一个compenent的子集, 同时一个executor只能对应于一个component Task就是具体的处理逻

Storm基本概念以及Topology的并发度

Spouts,流的源头 Spout是Storm里面特有的名词,Stream的源头,通常是从外部数据源读取tuples,并emit到topology Spout可以同时emit多个tupic stream,通过OutputFieldsDeclarer中的declareStream,method来定义 Spout需要实现RichSpout端口,最重要的方法是nextTuple,storm会不断调用接口从spout中取数据,同时需要注意的是Spout分为reliable or unreliable两种

storm并发机制,通信机制,任务提交

一.storm的并发 (1)Workers(JVMs):在一个物理节点上可以运行一个或多个独立的JVM进程.一个Topology可以包含一个或多个worker(并行的跑在不同的物理机上),所以worker process就是执行一个topology的子集, 并且worker只能对应于一个topology (2)Executors(threads):在一个workerJVM进程中运行着多个Java线程.一个executor线程可以执行一个或多个tasks.但一般默认每个executor只执行一个t

storm源码之理解Storm中Worker、Executor、Task关系【转】

[原]storm源码之理解Storm中Worker.Executor.Task关系 Storm在集群上运行一个Topology时,主要通过以下3个实体来完成Topology的执行工作:1. Worker(进程)2. Executor(线程)3. Task 下图简要描述了这3者之间的关系:                                                    1个worker进程执行的是1个topology的子集(注:不会出现1个worker为多个topology服

对JAVA多线程 并发编程的理解

对JAVA多线程并发编程的理解 Java多线程编程关注的焦点主要是对单一资源的并发访问,本文从Java如何实现支持并发访问的角度,浅析对并发编程的理解,也算是对前段时间所学的一个总结. 线程状态转换 Java语言定义了5中线程状态,在任何一个时间点,一个线程只能有且只有其中一种状态,这5中状态分别是: ?  新建(New):创建后尚未启动的线程处于这种状态 ?  运行(Runable):Runable包括了操作系统线程状态中的Running和Ready,也就是处于此状态的线程可能正在执行,也有可

算法复杂度的理解

算法复杂度的理解 from:http://blog.sina.com.cn/s/blog_4bab8e7f0102vmth.html 算法复杂度分为时间复杂度和空间复杂度. 时间复杂度的计算 ?1.一个算法执行所耗费的时间,从理论上是不能算出来的,必须上机运行测试才能知道.但我们不可能也没有必要对每个算法都上机测试,只需知道哪个算法花费的时间多,哪个算法花费的时间少就可以了.并且一个算法花费的时间与算法中语句的执行次数成正比例,哪个算法中语句执行次数多,它花费时间就多. 一个算法中的语句执行次数