spark streaming 4: InputDStream

InputDStream的继承关系。他们都是使用InputDStream这个抽象类的接口进行操作的。特别注意ReceiverInputDStream这个类,大部分时候我们使用的是它作为扩展的基类,因为它才能(更容易)使接收数据的工作分散到各个worker上执行,更符合分布式计算的理念。

所有的输入流都某个时间间隔将数据以block的形式保存到spark memory中,但以spark core不同的是,spark streaming默认是将对象序列化后保存到内存中。

/** * This is the abstract base class for all input streams. This class provides methods * start() and stop() which is called by Spark Streaming system to start and stop receiving data. * Input streams that can generate RDDs from new data by running a service/thread only on * the driver node (that is, without running a receiver on worker nodes), can be * implemented by directly inheriting this InputDStream. For example, * FileInputDStream, a subclass of InputDStream, monitors a HDFS directory from the driver for * new files and generates RDDs with the new files. For implementing input streams * that requires running a receiver on the worker nodes, use * [[org.apache.spark.streaming.dstream.ReceiverInputDStream]] as the parent class. * * @param ssc_ Streaming context that will execute this input stream */abstract class InputDStream[T: ClassTag] (@transient ssc_ : StreamingContext)  extends DStream[T](ssc_) {

private[streaming] var lastValidTime: Time = null

ssc.graph.addInputStream(this)
/** * Abstract class for defining any [[org.apache.spark.streaming.dstream.InputDStream]] * that has to start a receiver on worker nodes to receive external data. * Specific implementations of NetworkInputDStream must * define `the getReceiver()` function that gets the receiver object of type * [[org.apache.spark.streaming.receiver.Receiver]] that will be sent * to the workers to receive data. * @param ssc_ Streaming context that will execute this input stream * @tparam T Class type of the object of this stream */abstract class ReceiverInputDStream[T: ClassTag](@transient ssc_ : StreamingContext)  extends InputDStream[T](ssc_) {

/** Keeps all received blocks information */  private lazy val receivedBlockInfo = new HashMap[Time, Array[ReceivedBlockInfo]]

/** This is an unique identifier for the network input stream. */  val id = ssc.getNewReceiverStreamId()
/** * Gets the receiver object that will be sent to the worker nodes * to receive data. This method needs to defined by any specific implementation * of a NetworkInputDStream. */def getReceiver(): Receiver[T]

最终都是以BlockRDD返回的

/** Ask ReceiverInputTracker for received data blocks and generates RDDs with them. */override def compute(validTime: Time): Option[RDD[T]] = {  // If this is called for any time before the start time of the context,  // then this returns an empty RDD. This may happen when recovering from a  // master failure  if (validTime >= graph.startTime) {    val blockInfo = ssc.scheduler.receiverTracker.getReceivedBlockInfo(id)    receivedBlockInfo(validTime) = blockInfo    val blockIds = blockInfo.map(_.blockId.asInstanceOf[BlockId])    Some(new BlockRDD[T](ssc.sc, blockIds))  } else {    Some(new BlockRDD[T](ssc.sc, Array[BlockId]()))  }}

From WizNote

时间: 2024-10-11 05:14:52

spark streaming 4: InputDStream的相关文章

spark streaming kafka1.4.1中的低阶api createDirectStream使用总结

转载:http://blog.csdn.net/ligt0610/article/details/47311771 由于目前每天需要从kafka中消费20亿条左右的消息,集群压力有点大,会导致job不同程度的异常退出.原来使用spark1.1.0版本中的createStream函数,但是在数据处理速度跟不上数据消费速度且job异常退出的情况下,可能造成大量的数据丢失.幸好,Spark后续版本对这一情况有了很大的改进,1.2版本加入WAL特性,但是性能应该会受到一些影响(本人未测试),1.3版本可

spark streaming从指定offset处消费Kafka数据

spark streaming从指定offset处消费Kafka数据 2017-06-13 15:19 770人阅读 评论(2) 收藏 举报 分类: spark(5) 原文地址:http://blog.csdn.net/high2011/article/details/53706446 首先很感谢原文作者,看到这篇文章我少走了很多弯路,转载此文章是为了保留一份供复习用,请大家支持原作者,移步到上面的连接去看,谢谢 一.情景:当Spark streaming程序意外退出时,数据仍然再往Kafka中

spark streaming集成kafka

Kakfa起初是由LinkedIn公司开发的一个分布式的消息系统,后成为Apache的一部分,它使用Scala编写,以可水平扩展和高吞吐率而被广泛使用.目前越来越多的开源分布式处理系统如Cloudera.Apache Storm.Spark等都支持与Kafka集成. Spark streaming集成kafka是企业应用中最为常见的一种场景. 一.安装kafka 参考文档: http://kafka.apache.org/quickstart#quickstart_createtopic 1.安

Spark streaming + Kafka 流式数据处理,结果存储至MongoDB、Solr、Neo4j(自用)

KafkaStreaming.scala文件 import kafka.serializer.StringDecoder import org.apache.spark.SparkConf import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.kafka.{KafkaManagerAdd, KafkaUtils} import org.json4s.Defau

spark streaming 接收kafka消息之一 -- 两种接收方式

源码分析的spark版本是1.6. 首先,先看一下 org.apache.spark.streaming.dstream.InputDStream 的 类说明: This is the abstract base class for all input streams. This class provides methods start() and stop() which is called by Spark Streaming system to start and stop receivi

第12课:Spark Streaming源码解读之Executor容错安全性

一.Spark Streaming 数据安全性的考虑: Spark Streaming不断的接收数据,并且不断的产生Job,不断的提交Job给集群运行.所以这就涉及到一个非常重要的问题数据安全性. Spark Streaming是基于Spark Core之上的,如果能够确保数据安全可好的话,在Spark Streaming生成Job的时候里面是基于RDD,即使运行的时候出现问题,那么Spark Streaming也可以借助Spark Core的容错机制自动容错. 对Executor容错主要是对数

第4课:Spark Streaming的Exactly Once的事务处理

本期内容: Exactly once 输出不重复 Exactly once 1,事务一定会被处理,且只被处理一次: 2,输出能够输出且只会被输出. Receiver:数据通过BlockManager写入内存+磁盘或者通过WAL来保证数据的安全性. WAL机制:写数据时先通过WAL写入文件系统然后存储的Executor(存储在内存和磁盘中,由StorageLevel设定),假设前面没有写成功后面一定不会存储在Executor,如不存在Executor中的话,汇报Driver数据一定不被处理.WAL

Spark Streaming源代码学习总结(一)

1.Spark Streaming 代码分析: 1.1 演示样例代码DEMO: 实时计算的WorldCount: import org.apache.spark.streaming.{Seconds, StreamingContext} import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.storage.StorageLevel object NetworkWordCount { def mai

Spark Streaming源码解读之Receiver在Driver的精妙实现全生命周期彻底研究和思考

一:Receiver启动的方式设想 1. Spark Streaming通过Receiver持续不断的从外部数据源接收数据,并把数据汇报给Driver端,由此每个Batch Durations就可以根据汇报的数据生成不同的Job. 2. Receiver属于Spark Streaming应用程序启动阶段,那么我们找Receiver在哪里启动就应该去找Spark Streaming的启动. 3. Receivers和InputDStreams是一一对应的,默认情况下一般只有一个Receiver.