Storm术语解释

本篇文章主要介绍storm的关键概念!(翻译摘取至徐明明博客



This page lists the main concepts of Storm and links to resources where you can find more information. The concepts discussed are:

  1. Topologies
  2. Streams
  3. Spouts
  4. Bolts
  5. Stream groupings
  6. Reliability
  7. Tasks
  8. Workers

1、Topologies——技术拓扑

The logic for a realtime application is packaged into a Storm topology. A Storm topology is analogous to a MapReduce job. One key difference is that a MapReduce job eventually finishes, whereas a topology runs forever (or until you kill it, of course). A topology is a graph of spouts and bolts that are connected with stream groupings. These concepts are described below.

一个实时计算应用程序的逻辑在storm里面被封装到topology对象里面, 我把它叫做计算拓补. Storm里面的topology相当于Hadoop里面的一个MapReduce Job, 它们的关键区别是:一个MapReduce Job最终总是会结束的, 然而一个storm的topoloy会一直运行 — 除非你显式的杀死它。 一个Topology是Spouts和Bolts组成的图状结构, 而链接Spouts和Bolts的则是Stream groupings。下面会有这些感念的描述。

Resources:


2、Streams——消息流

The stream is the core abstraction in Storm. A stream is an unbounded sequence of tuples that is processed and created in parallel in a distributed fashion. Streams are defined with a schema that names the fields in the stream’s tuples. By default, tuples can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays. You can also define your own serializers so that custom types can be used natively within tuples.

消息流是storm里面的最关键的抽象。一个消息流是一个没有边界的tuple序列, 而这些tuples会被以一种分布式的方式并行地创建和处理。 对消息流的定义主要是对消息流里面的tuple的定义, 我们会给tuple里的每个字段一个名字。 并且不同tuple的对应字段的类型必须一样。 也就是说: 两个tuple的第一个字段的类型必须一样, 第二个字段的类型必须一样, 但是第一个字段和第二个字段可以有不同的类型。) 在默认的情况下, tuple的字段类型可以是: integer, long, short, byte, string, double, float, boolean和byte array。 你还可以自定义类型 — 只要你实现对应的序列化器。

Every stream is given an id when declared. Since single-stream spouts and bolts are so common, OutputFieldsDeclarer has convenience methods for declaring a single stream without specifying an id. In this case, the stream is given the default id of “default”.

每个消息流在定义的时候会被分配给一个id, 因为单向消息流是那么的普遍, OutputFieldsDeclarer定义了一些方法让你可以定义一个stream而不用指定这个id。在这种情况下这个stream会有个默认的id: 1.

Resources:


3、Spouts——消息源

A spout is a source of streams in a topology. Generally spouts will read tuples from an external source and emit them into the topology (e.g. a Kestrel queue or the Twitter API). Spouts can either be reliable or unreliable. A reliable spout is capable of replaying a tuple if it failed to be processed by Storm, whereas an unreliable spout forgets about the tuple as soon as it is emitted.

消息源Spouts是storm里面一个topology里面的消息生产者。一般来说消息源会从一个外部源读取数据并且向topology里面发出消息: tuple。 消息源Spouts可以是可靠的也可以是不可靠的。一个可靠的消息源可以重新发射一个tuple如果这个tuple没有被storm成功的处理, 但是一个不可靠的消息源Spouts一旦发出一个tuple就把它彻底忘了 — 也就不可能再发了。

Spouts can emit more than one stream. To do so, declare multiple streams using thedeclareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on SpoutOutputCollector.

消息源可以发射多条消息流stream。要达到这样的效果, 使用OutFieldsDeclarer.declareStream来定义多个stream, 然后使用SpoutOutputCollector来发射指定的sream。

The main method on spouts is nextTuplenextTuple either emits a new tuple into the topology or simply returns if there are no new tuples to emit. It is imperative that nextTuple does not block for any spout implementation, because Storm calls all the spout methods on the same thread.

Spout类里面最重要的方法是nextTuple要么发射一个新的tuple到topology里面或者简单的返回如果已经没有新的tuple了。要注意的是nextTuple方法不能block Spout的实现, 因为storm在同一个线程上面调用所有消息源Spout的方法。

The other main methods on spouts are ack and fail. These are called when Storm detects that a tuple emitted from the spout either successfully completed through the topology or failed to be completed. ack and fail are only called for reliable spouts. See the Javadoc for more information.

另外两个比较重要的Spout方法是ack和fail。storm在检测到一个tuple被整个topology成功处理的时候调用ack, 否则调用fail。storm只对可靠的spout调用ack和fail。

Resources:


4、Bolts——消息处理者

All processing in topologies is done in bolts. Bolts can do anything from filtering, functions, aggregations, joins, talking to databases, and more.

所有的消息处理逻辑被封装在bolts里面。 Bolts可以做很多事情: 过滤, 聚合, 查询数据库等等等等。

Bolts can do simple stream transformations. Doing complex stream transformations often requires multiple steps and thus multiple bolts. For example, transforming a stream of tweets into a stream of trending images requires at least two steps: a bolt to do a rolling count of retweets for each image, and one or more bolts to stream out the top X images (you can do this particular stream transformation in a more scalable way with three bolts than with two).

Bolts可以简单的做消息流的传递。复杂的消息流处理往往需要很多步骤, 从而也就需要经过很多Bolts。比如算出一堆图片里面被转发最多的图片就至少需要两步: 第一步算出每个图片的转发数量。第二步找出转发最多的前10个图片。(如果要把这个过程做得更具有扩展性那么可能需要更多的步骤)。

Bolts can emit more than one stream. To do so, declare multiple streams using the declareStream method of OutputFieldsDeclarer and specify the stream to emit to when using the emit method on OutputCollector.

Bolts可以发射多条消息流, 使用OutputFieldsDeclarer.declareStream定义stream, 使用OutputCollector.emit来选择要发射的stream。

When you declare a bolt’s input streams, you always subscribe to specific streams of another component. If you want to subscribe to all the streams of another component, you have to subscribe to each one individually. InputDeclarer has syntactic sugar for subscribing to streams declared on the default stream id. Saying declarer.shuffleGrouping("1") subscribes to the default stream on component “1” and is equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID).

The main method in bolts is the execute method which takes in as input a new tuple. Bolts emit new tuples using the OutputCollector object. Bolts must call the ack method on the OutputCollector for every tuple they process so that Storm knows when tuples are completed (and can eventually determine that its safe to ack the original spout tuples). For the common case of processing an input tuple, emitting 0 or more tuples based on that tuple, and then acking the input tuple, Storm provides an IBasicBolt interface which does the acking automatically.

Its perfectly fine to launch new threads in bolts that do processing asynchronously.OutputCollector is thread-safe and can be called at any time.

Bolts的主要方法是execute, 它以一个tuple作为输入,Bolts使用OutputCollector来发射tuple, Bolts必须要为它处理的每一个tuple调用OutputCollector的ack方法,以通知storm这个tuple被处理完成了。– 从而我们通知这个tuple的发射者Spouts。 一般的流程是: Bolts处理一个输入tuple,  发射0个或者多个tuple, 然后调用ack通知storm自己已经处理过这个tuple了。storm提供了一个IBasicBolt会自动调用ack。

Resources:


5、Stream groupings——消息分发策略

Part of defining a topology is specifying for each bolt which streams it should receive as input. A stream grouping defines how that stream should be partitioned among the bolt’s tasks.

定义一个Topology的其中一步是定义每个bolt接受什么样的流作为输入。stream grouping就是用来定义一个stream应该如果分配给Bolts上面的多个Tasks。

There are seven built-in stream groupings in Storm, and you can implement a custom stream grouping by implementing the CustomStreamGrouping interface:

storm里面有7种类型的stream grouping:

  1. Shuffle grouping: Tuples are randomly distributed across the bolt’s tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Fields grouping: The stream is partitioned by the fields specified in the grouping. For example, if the stream is grouped by the “user-id” field, tuples with the same “user-id” will always go to the same task, but tuples with different “user-id”’s may go to different tasks.
  3. All grouping: The stream is replicated across all the bolt’s tasks. Use this grouping with care.
  4. Global grouping: The entire stream goes to a single one of the bolt’s tasks. Specifically, it goes to the task with the lowest id.
  5. None grouping: This grouping specifies that you don’t care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings. Eventually though, Storm will push down bolts with none groupings to execute in the same thread as the bolt or spout they subscribe from (when possible).
  6. Direct grouping: This is a special kind of grouping. A stream grouped this way means that the producer of the tuple decides which task of the consumer will receive this tuple. Direct groupings can only be declared on streams that have been declared as direct streams. Tuples emitted to a direct stream must be emitted using one of the [emitDirect](/apidocs/backtype/storm/task/OutputCollector.html#emitDirect(int, int, java.util.List) methods. A bolt can get the task ids of its consumers by either using the provided TopologyContext or by keeping track of the output of the emitmethod in OutputCollector (which returns the task ids that the tuple was sent to).
  7. Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process, tuples will be shuffled to just those in-process tasks. Otherwise, this acts like a normal shuffle grouping.
  1. Shuffle Grouping: 随机分组, 随机派发stream里面的tuple, 保证每个bolt接收到的tuple数目相同。
  2. Fields Grouping:按字段分组, 比如按userid来分组, 具有同样userid的tuple会被分到相同的Bolts, 而不同的userid则会被分配到不同的Bolts。
  3. All Grouping: 广播发送, 对于每一个tuple, 所有的Bolts都会收到。
  4. Global Grouping: 全局分组, 这个tuple被分配到storm中的一个bolt的其中一个task。再具体一点就是分配给id值最低的那个task。
  5. Non Grouping: 不分组, 这个分组的意思是说stream不关心到底谁会收到它的tuple。目前这种分组和Shuffle grouping是一样的效果, 有一点不同的是storm会把这个bolt放到这个bolt的订阅者同一个线程里面去执行。
  6. Direct Grouping: 直接分组,  这是一种比较特别的分组方法,用这种分组意味着消息的发送者指定由消息接收者的哪个task处理这个消息。 只有被声明为Direct Stream的消息流可以声明这种分组方法。而且这种消息tuple必须使用emitDirect方法来发射。消息处理者可以通过TopologyContext来获取处理它的消息的taskid (OutputCollector.emit方法也会返回taskid)
  7. Local or shuffle grouping:

Resources:

  • TopologyBuilder: use this class to define topologies
  • InputDeclarer: this object is returned whenever setBolt is called on TopologyBuilder and is used for declaring a bolt’s input streams and how those streams should be grouped
  • CoordinatedBolt: this bolt is useful for distributed RPC topologies and makes heavy use of direct streams and direct groupings

6、Reliability——可靠性

Storm guarantees that every spout tuple will be fully processed by the topology. It does this by tracking the tree of tuples triggered by every spout tuple and determining when that tree of tuples has been successfully completed. Every topology has a “message timeout” associated with it. If Storm fails to detect that a spout tuple has been completed within that timeout, then it fails the tuple and replays it later.

storm保证每个tuple会被topology完整的执行。storm会追踪由每个spout tuple所产生的tuple树(一个bolt处理一个tuple之后可能会发射别的tuple从而可以形成树状结构), 并且跟踪这棵tuple树什么时候成功处理完。每个topology都有一个消息超时的设置, 如果storm在这个超时的时间内检测不到某个tuple树到底有没有执行成功, 那么topology会把这个tuple标记为执行失败,并且过一会会重新发射这个tuple。

To take advantage of Storm’s reliability capabilities, you must tell Storm when new edges in a tuple tree are being created and tell Storm whenever you’ve finished processing an individual tuple. These are done using the OutputCollector object that bolts use to emit tuples. Anchoring is done in the emit method, and you declare that you’re finished with a tuple using the ack method.

为了利用storm的可靠性特性,在你发出一个新的tuple以及你完成处理一个tuple的时候你必须要通知storm。这一切是由OutputCollector来完成的。通过它的emit方法来通知一个新的tuple产生了, 通过它的ack方法通知一个tuple处理完成了。

This is all explained in much more detail in Guaranteeing message processing.


7、Tasks——任务

Each spout or bolt executes as many tasks across the cluster. Each task corresponds to one thread of execution, and stream groupings define how to send tuples from one set of tasks to another set of tasks. You set the parallelism for each spout or bolt in the setSpout and setBolt methods of TopologyBuilder.

每一个Spout和Bolt会被当作很多task在整个集群里面执行。每一个task对应到一个线程,而stream grouping则是定义怎么从一堆task发射tuple到另外一堆task。你可以调用TopologyBuilder.setSpout()和TopBuilder.setBolt()来设置并行度 — 也就是有多少个task。


8、Workers——工作进程

Topologies execute across one or more worker processes. Each worker process is a physical JVM and executes a subset of all the tasks for the topology. For example, if the combined parallelism of the topology is 300 and 50 workers are allocated, then each worker will execute 6 tasks (as threads within the worker). Storm tries to spread the tasks evenly across all the workers.

一个topology可能会在一个或者多个工作进程里面执行,每个工作进程执行整个topology的一部分。比如对于并行度是300的topology来说,如果我们使用50个工作进程来执行,那么每个工作进程会处理其中的6个tasks(其实就是每个工作进程里面分配6个线程)。storm会尽量均匀的工作分配给所有的工作进程。

Resources:


注:配置

storm里面有一堆参数可以配置来调整nimbus, supervisor以及正在运行的topology的行为, 一些配置是系统级别的, 一些配置是topology级别的。所有有默认值的 配置的  默认配置 是配置在default.xml里面的。你可以通过定义个storm.xml在你的classpath里面来覆盖这些默认配置。并且你也可以在代码里面设置一些topology相关的配置信息  – 使用StormSubmitter。当然,这些配置的优先级是: default.xml < storm.xml < TOPOLOGY-SPECIFIC配置。

时间: 2024-12-28 16:04:42

Storm术语解释的相关文章

刨根究底字符编码之二——关键术语解释(下)

关键术语解释(下) 一.第1层 抽象字符表ACR (Abstract Character Repertoire抽象字符清单):明确字符的范围(即确定支持哪些字符) 1. 抽象字符表ACR是一个编码系统支持的所有抽象字符的集合,可以简单理解为无序的字符集合,用于确定字符的范围,即要支持哪些字符. 抽象字符表ACR的一个重要特点是字符的无序性,即其中的字符并没有编排数字顺序,当然也就没有数字编号. 2. "抽象"字符不具有某种特定的字形,不应与具有某种特定字形的"具体"

【SSO单点系列】(6):CAS4.0 单点流程序列图(中文版)以及相关术语解释(TGT、ST、PGT、PT、PGTIOU)

CAS 相关的内容好久没写了,可能下周会继续更新一些内容吧. 在上一篇中的单点流程序列图由于是从官网直接下载来的,上面都是英文,可能有的朋友看不懂,因此修改成中文的. PS:只修改了一个,第二个图明天在加... 在这之前,先解释几个CAS 相关的术语解释吧: 1.概念相关 ①.术语解释 TGT.ST.PGT.PGTIOU.PT,其中CAS1.0协议中就有的票据,PGT.PGTIOU.PT是CAS2.0协议中有的票据. CAS为用户签发的登录票据,拥有了CAS成功登录过.CAS认证成功后,TGT对

js术语解释

浏览器嗅探(browser sniffing):通过js获取浏览器的名称和版本. 对象检测(object detection):检查用户代理是否支持一定的对象,并使之成为关键的区分标准. 文档对象模型(Document Object Model, DOM):每个浏览器都提供了显示并用来操作的文档,这种操作通过文档对象模型来实现. 渐进增强(progressive enhancement):一种实践,它只面向那些能够看到并且使用的用户提供功能,从最低的公共特性开始,然后检测用户是否支持不断提升的特

spss C# 二次开发 学习笔记(二)——Spss以及统计术语解释(IT人眼中的统计术语)

针对客户需求,需要对一些数据做统计分析.统计分析的第一步,即为数据查询,查找出要统计分析的数据. 查询得出的是一个行列表格的结果集,行.列.表格等这些IT的数据库概念和Spss以及统计中的术语是如何对应的,这点是刚接触统计这方面的我首先要理清楚的. 变量(Variable)——结果集中的列.可以为数据库表字段,当然也可以使使用函数处理后的,即为表达式. 变量的类型分为字符型和数值型,有长度设置,对于数值有精度设置. 数值型,有数值和数值标签一说,例如性别,如果为字符型,则值为男或者女,如果为数值

刨根究底字符编码之一——关键术语解释(上)

声明:本系列文章参考了网上的大量资料,除了少部分资料由于未作大量修改(但基本上也有少量修改,因为网上文章随意性较大,很多明显的笔误或前后矛盾之处,如若不改反而让人迷糊)而标明了原作者和出处之外,其余由于基本上已按自己的理解作了大量改写,因此没有再一一予以说明,在此对原作者表示歉意并感谢.另外,文中图片来自网络,也不在一一说明.同时,文中若有错漏,还请直接招呼板砖,不用客气. 关键术语解释 位: 即比特(Bit),亦称二进制位.比特位.位元.位,指二进制数中的一位,是计算机中信息表示的最小单位.B

kafka深入研究之路(2) kafka简介与专业术语解释说明

目录:1.kafka简介 什么是kafka? 设计目标是什么?2.kafka的优缺点3.kafka中专业术语解释说明 官方网站: http://kafka.apache.org/introkafka中文教程 http://orchome.com/kafka/index 1/ kafka 简介Kafka是最初由Linkedin公司开发,是一个分布式.分区的.多副本的.多订阅者,基于zookeeper协调的分布式日志系统(也可以当做MQ系统),常见可以用于web/nginx日志.访问日志,消息服务等

听力|术语解释原则|转折原则|提问原则|并列

术语解释原则,记录时使用利用辅音进行分层. 转折原则,但是存在的问题是转折前后有部分听不清,不是听不清转折前就是听不懂转折后,此时应该通过明确信息走向来判断未知信息.常见于重听题:Hugo is a brilliant writer,but/yet/however his work XXX isn’t a great play. Signal: But/yet/however Although/though/despite (问答场景下)well...interesting,but 画外音: Y

IOS开发网络篇—SDK API IDE专业术语解释

API文档Application Programming Interface(应用程序接口)   reference 参考文档 Documentation  程序说明书 SDK (Software 软件 Development 开发 Kit)软件开发包 IDE 集成开发环境(integrated development environment)软件中的意思 IDE 电子集成驱动器(Integrated Drive Electronics)  相对硬件的意思 一.什么是SDK? SDK(Softw

spring AOP 术语解释

初看aop,上来就是一大堆术语,而且还有个拉风的名字,面向切面编程,都说是OOP的一种有益补充等等.一下子让你不知所措,心想着:怪不得很多人都和我说aop多难多难.当我看进去以后,我才发现:它就是一些java基础上的朴实无华的应用,包括ioc,包括许许多多这样的名词,都是万变不离其宗而已. 2.为什么用aop 1就是为了方便,看一个国外很有名的大师说,编程的人都是"懒人",因为他把自己做的事情都让程序做了.用了aop能让你少写很多代码,这点就够充分了吧 2就是为了更清晰的逻辑,可以让你