Streaming Big Data: Storm, Spark and Samza--转载

原文地址:http://www.javacodegeeks.com/2015/02/streaming-big-data-storm-spark-samza.html

There are a number of distributed computation systems that can process Big Data in real time or near-real time. This article will start with a short description of three Apache frameworks, and attempt to provide a quick, high-level overview of some of their similarities and differences.

Apache Storm

In Storm, you design a graph of real-time computation called a topology, and feed it to the cluster where the master node will distribute the code among worker nodes to execute it. In a topology, data is passed around between spouts that emit data streams as immutable sets of key-value pairs called tuples, and bolts that transform those streams (count, filter etc.). Bolts themselves can optionally emit data to other bolts down the processing pipeline.

Apache Spark

Spark Streaming (an extension of the core Spark API) doesn’t process streams one at a time like Storm. Instead, it slices them in small batches of time intervals before processing them. The Spark abstraction for a continuous stream of data is called a DStream (for Discretized Stream). A DStream is a micro-batch of RDDs (Resilient Distributed Datasets). RDDs are distributed collections that can be operated in parallel by arbitrary functions and by transformations over a sliding window of data (windowed computations).

Apache Samza

Samza ’s approach to streaming is to process messages as they are received, one at a time. Samza’s stream primitive is not a tuple or a Dstream, but a message. Streams are divided into partitions and each partition is an ordered sequence of read-only messages with each message having a unique ID (offset). The system also supports batching, i.e. consuming several messages from the same stream partition in sequence. Samza`s Execution & Streaming modules are both pluggable, although Samza typically relies on Hadoop’s YARN (Yet Another Resource Negotiator) and Apache Kafka.

Common Ground

All three real-time computation systems are open-source, low-latencydistributed, scalable and fault-tolerant. They all allow you to run your stream processing code through parallel tasks distributed across a cluster of computing machines with fail-over capabilities. They also provide simple APIs to abstract the complexity of the underlying implementations.

The three frameworks use different vocabularies for similar concepts:

Comparison Matrix

A few of the differences are summarized in the table below:

There are three general categories of delivery patterns:

  1. At-most-once: messages may be lost. This is usually the least desirable outcome.
  2. At-least-once: messages may be redelivered (no loss, but duplicates). This is good enough for many use cases.
  3. Exactly-once: each message is delivered once and only once (no loss, no duplicates). This is a desirable feature although difficult to guarantee in all cases.

Another aspect is state management. There are different strategies to store state. Spark Streaming writes data into the distributed file system (e.g. HDFS). Samza uses an embedded key-value store. With Storm, you’ll have to either roll your own state management at your application layer, or use a higher-level abstraction called Trident.

Use Cases

All three frameworks are particularly well-suited to efficiently process continuous, massive amounts of real-time data. So which one to use? There are no hard rules, at most a few general guidelines.

If you want a high-speed event processing system that allows for incremental computations, Storm would be fine for that. If you further need to run distributed computations on demand, while the client is waiting synchronously for the results, you’ll have Distributed RPC (DRPC) out-of-the-box. Last but not least, because Storm uses Apache Thrift, you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery though, you should look at the higher-level Trident API, which also offers micro-batching.

A few companies using Storm: Twitter, Yahoo!, Spotify, The Weather Channel...

Speaking of micro-batching, if you must have stateful computations, exactly-once delivery and don’t mind a higher latency, you could consider Spark Streaming…specially if you also plan for graph operations, machine learning or SQL access. The Apache Spark stack lets you combine several libraries with streaming (Spark SQLMLlibGraphX) and provides a convenient unifying programming model. In particular, streaming algorithms (e.g. streaming k-means) allow Spark to facilitate decisions in real-time.

A few companies using Spark: Amazon, Yahoo!, NASA JPL, eBay Inc., Baidu…

If you have a large amount of state to work with (e.g. many gigabytes per partition), Samza co-locates storage and processing on the same machines, allowing to work efficiently with state that won’t fit in memory. The framework also offers flexibility with its pluggable API: its default execution, messaging and storage engines can each be replaced with your choice of alternatives. Moreover, if you have a number of data processing stages from different teams with different codebases, Samza ‘s fine-grained jobs would be particularly well-suited, since they can be added/removed with minimal ripple effects.

A few companies using Samza: LinkedIn, Intuit, Metamarkets, Quantiply, Fortscale…

Conclusion

We only scratched the surface of The Three Apaches. We didn’t cover a number of other features and more subtle differences between these frameworks. Also, it’s important to keep in mind the limits of the above comparisons, as these systems are constantly evolving.

时间: 2024-10-25 08:20:15

Streaming Big Data: Storm, Spark and Samza--转载的相关文章

7.Spark Streaming(上)--Spark Streaming原理介绍

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1.Spark Streaming简介 1.1 概述 Spark Streaming 是Spark核心API的一个扩展,可以实现高吞吐量的.具备容错机制的实时流数据的处理.支持从多种数据源获取数据,包括Kafk.Flume.Twitter.ZeroMQ.Kinesis 以及TCP sockets,从数据源获取数据之后,可以使用诸如map.reduce.join和window等高级函数进行复杂算法的处

Conquer Big Data through Spark

Course Background: Apache Spark™ is a fast and general engine for large-scale data processing. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. You can run programs up to 100x faster than Hadoop MapRe

马化腾漫谈“流式大数据处理的三种框架:Storm,Spark和Samza”

Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topology).这个拓扑将会被提交给集群,由集群中的主控节点(master node)分发代码,将任务分配给工作节点(worker node)执行.一个拓扑中包括spout和bolt两种角色,其中spout发送消息,负责将数据流以tuple元组的形式发送出去:而bolt则负责转换这些数据流,在bolt中可以完成计算.过滤等操作,bolt自身也可以随机将数据发送给其他bolt.由spout发射出的tu

流式大数据处理的三种框架:Storm,Spark和Samza

许多分布式计算系统都可以实时或接近实时地处理大数据流.本文将对三种Apache框架分别进行简单介绍,然后尝试快速.高度概述其异同. Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topology).这个拓扑将会被提交给集群,由集群中的主控节点(master node)分发代码,将任务分配给工作节点(worker node)执行.一个拓扑中包括spout和bolt两种角色,其中spout发送消息,负责将数据流以tuple元组的形式发送出去:而bolt

处理大数据流常用的三种Apache框架:Storm、Spark和Samza。(主要介绍Storm)

处理实时的大数据流最常用的就是分布式计算系统,下面分别介绍Apache中处理大数据流的三大框架: Apache Storm     这是一个分布式实时大数据处理系统.Storm设计用于在容错和水平可扩展方法中处理大量数据.他是一个流数据框架,具有最高的社区率.虽然Storm是无状态的,它通过ApacheZooKeeper管理分布式环境和鸡群状态.使用起来非常简单,并且还支持并行地对实时数据执行各种操作. Apache Storm继续成为实时数据分析的领导者是因为它的易于操作和设置,并且它保证每个

Storm,Spark和Samza

http://www.csdn.net/article/2015-03-09/2824135 Apache Storm 在Storm中,先要设计一个用于实时计算的图状结构,我们称之为拓扑(topology).这个拓扑将会被提交给集群,由集群中的主控节点(master node)分发代码,将任务分配给工作节点(worker node)执行.一个拓扑中包括spout和bolt两种角色,其中spout发送消息,负责将数据流以tuple元组的形式发送出去:而bolt则负责转换这些数据流,在bolt中可以

Spark入门实战系列--7.Spark Streaming(下)--Spark Streaming实战

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送–Spark入门实战系列>获取 1 实例演示 1.1 流数据模拟器 1.1.1 流数据说明 在实例演示中模拟实际情况,需要源源不断地接入流数据,为了在演示过程中更接近真实环境将定义流数据模拟器.该模拟器主要功能:通过Socket方式监听指定的端口号,当外部程序通过该端口连接并请求数据时,模拟器将定时将指定的文件数据随机获取发送给外部程序. 1.1.2 模拟器代码 import java.io.{PrintWriter} import

Spark Streaming、Kafka结合Spark JDBC External DataSouces处理案例

场景:使用Spark Streaming接收Kafka发送过来的数据与关系型数据库中的表进行相关的查询操作: Kafka发送过来的数据格式为:id.name.cityId,分隔符为tab 1 zhangsan 1 2 lisi 1 3 wangwu 2 4 zhaoliu 3 MySQL的表city结构为:id int, name varchar 1 bj 2 sz 3 sh 本案例的结果为:select s.id, s.name, s.cityId, c.name from student s

Data guard概念篇一(转载)

本文转载至以下链接,感谢作者分享! http://tech.it168.com/db/2008-02-14/200802141545840_1.shtml 一.Data Guard配置(Data Guard Configurations) Data Guard是一个集合,由一个primary数据库(生产数据库)及一个或多个standby数据库(最多9个)组成.组成Data Guard的数据库通过Oracle Net连接,并且有可能分布于不同地域.只要各库之间可以相互通信,它们的物理位置并没有什么