spark streaming 4: InputDStream

InputDStream的继承关系。他们都是使用InputDStream这个抽象类的接口进行操作的。特别注意ReceiverInputDStream这个类，大部分时候我们使用的是它作为扩展的基类，因为它才能（更容易）使接收数据的工作分散到各个worker上执行，更符合分布式计算的理念。

所有的输入流都某个时间间隔将数据以block的形式保存到spark memory中，但以spark core不同的是，spark streaming默认是将对象序列化后保存到内存中。

/** * This is the abstract base class for all input streams. This class provides methods * start() and stop() which is called by Spark Streaming system to start and stop receiving data. * Input streams that can generate RDDs from new data by running a service/thread only on * the driver node (that is, without running a receiver on worker nodes), can be * implemented by directly inheriting this InputDStream. For example, * FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for * new files and generates RDDs with the new files. For implementing input streams * that requires running a receiver on the worker nodes, use * [[org.apache.spark.streaming.dstream.ReceiverInputDStream]] as the parent class. * * @param ssc_ Streaming context that will execute this input stream */abstract class InputDStream[T: ClassTag] (@transient ssc_ : StreamingContext)  extends DStream[T](ssc_) {

private[streaming] var lastValidTime: Time = null

ssc.graph.addInputStream(this)

/** * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]] * that has to start a receiver on worker nodes to receive external data. * Specific implementations of NetworkInputDStream must * define `the getReceiver()` function that gets the receiver object of type * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent * to the workers to receive data. * @param ssc_ Streaming context that will execute this input stream * @tparam T Class type of the object of this stream */abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext)  extends InputDStream[T](ssc_) {

/** Keeps all received blocks information */  private lazy val receivedBlockInfo = new HashMap[Time, Array[ReceivedBlockInfo]]

/** This is an unique identifier for the network input stream. */  val id = ssc.getNewReceiverStreamId()

/** * Gets the receiver object that will be sent to the worker nodes * to receive data. This method needs to defined by any specific implementation * of a NetworkInputDStream. */def getReceiver(): Receiver[T]

最终都是以BlockRDD返回的

/** Ask ReceiverInputTracker for received data blocks and generates RDDs with them. */override def compute(validTime: Time): Option[RDD[T]] = {  // If this is called for any time before the start time of the context,  // then this returns an empty RDD. This may happen when recovering from a  // master failure  if (validTime >= graph.startTime) {    val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id)    receivedBlockInfo(validTime) = blockInfo    val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId])    Some(new BlockRDD[T](ssc.sc, blockIds))  } else {    Some(new BlockRDD[T](ssc.sc, Array[BlockId]()))  }}

From WizNote

时间： 2024-10-11 05:14:52

spark streaming 4: InputDStream的相关文章

spark streaming kafka1.4.1中的低阶api createDirectStream使用总结

转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不同程度的异常退出.原来使用spark1.1.0版本中的createStream函数,但是在数据处理速度跟不上数据消费速度且job异常退出的情况下,可能造成大量的数据丢失.幸好,Spark后续版本对这一情况有了很大的改进,1.2版本加入WAL特性,但是性能应该会受到一些影响(本人未测试),1.3版本可

spark streaming从指定offset处消费Kafka数据

spark streaming从指定offset处消费Kafka数据 2017-06-13 15:19 770人阅读评论(2) 收藏举报分类: spark(5) 原文地址:http://blog.csdn.net/high2011/article/details/53706446 首先很感谢原文作者,看到这篇文章我少走了很多弯路,转载此文章是为了保留一份供复习用,请大家支持原作者,移步到上面的连接去看,谢谢一.情景:当Spark streaming程序意外退出时,数据仍然再往Kafka中

spark streaming集成kafka

Kakfa起初是由LinkedIn公司开发的一个分布式的消息系统,后成为Apache的一部分,它使用Scala编写,以可水平扩展和高吞吐率而被广泛使用.目前越来越多的开源分布式处理系统如Cloudera.Apache Storm.Spark等都支持与Kafka集成. Spark streaming集成kafka是企业应用中最为常见的一种场景. 一.安装kafka 参考文档: http://kafka.apache.org/quickstart#quickstart_createtopic 1.安

Spark streaming + Kafka 流式数据处理，结果存储至MongoDB、Solr、Neo4j（自用）

KafkaStreaming.scala文件 import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka.{KafkaManagerAdd, KafkaUtils} import org.json4s.Defau

spark streaming 接收kafka消息之一 -- 两种接收方式

源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的类说明: This is the abstract base class for all input streams. This class provides methods start() and stop() which is called by Spark Streaming system to start and stop receivi

第12课：Spark Streaming源码解读之Executor容错安全性

一.Spark Streaming 数据安全性的考虑: Spark Streaming不断的接收数据,并且不断的产生Job,不断的提交Job给集群运行.所以这就涉及到一个非常重要的问题数据安全性. Spark Streaming是基于Spark Core之上的,如果能够确保数据安全可好的话,在Spark Streaming生成Job的时候里面是基于RDD,即使运行的时候出现问题,那么Spark Streaming也可以借助Spark Core的容错机制自动容错. 对Executor容错主要是对数

第4课：Spark Streaming的Exactly Once的事务处理

本期内容: Exactly once 输出不重复 Exactly once 1,事务一定会被处理,且只被处理一次: 2,输出能够输出且只会被输出. Receiver:数据通过BlockManager写入内存+磁盘或者通过WAL来保证数据的安全性. WAL机制:写数据时先通过WAL写入文件系统然后存储的Executor(存储在内存和磁盘中,由StorageLevel设定),假设前面没有写成功后面一定不会存储在Executor,如不存在Executor中的话,汇报Driver数据一定不被处理.WAL

Spark Streaming源代码学习总结(一)

1.Spark Streaming 代码分析: 1.1 演示样例代码DEMO: 实时计算的WorldCount: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def mai

Spark Streaming源码解读之Receiver在Driver的精妙实现全生命周期彻底研究和思考

一:Receiver启动的方式设想 1. Spark Streaming通过Receiver持续不断的从外部数据源接收数据,并把数据汇报给Driver端,由此每个Batch Durations就可以根据汇报的数据生成不同的Job. 2. Receiver属于Spark Streaming应用程序启动阶段,那么我们找Receiver在哪里启动就应该去找Spark Streaming的启动. 3. Receivers和InputDStreams是一一对应的,默认情况下一般只有一个Receiver.