SparkStreaming参数配置


Property Name


Default


Meaning


spark.streaming.backpressure.enabled


false


Enables or disables Spark Streaming‘s internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate andspark.streaming.kafka.maxRatePerPartition if they are set (see below).


spark.streaming.backpressure.initialRate


not set


This is the initial maximum receiving rate at which each receiver will receive data for the first batch when the backpressure mechanism is enabled.


spark.streaming.blockInterval


200ms


Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the performance tuningsection in the Spark Streaming programing guide for more details.


spark.streaming.receiver.maxRate


not set


Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details.


spark.streaming.receiver.writeAheadLog.enable


false


Enable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures. See the deployment guidein the Spark Streaming programing guide for more details.


spark.streaming.unpersist


true


Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from Spark‘s memory. The raw input data received by Spark Streaming is also automatically cleared. Setting this to false will allow the raw data and persisted RDDs to be accessible outside the streaming application as they will not be cleared automatically. But it comes at the cost of higher memory usage in Spark.


spark.streaming.stopGracefullyOnShutdown


false


If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately.


spark.streaming.kafka.maxRatePerPartition


not set


Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API. See the Kafka Integration guide for more details.


spark.streaming.kafka.maxRetries


1


Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the new Kafka direct stream API.


spark.streaming.ui.retainedBatches


1000


How many batches the Spark Streaming UI and status APIs remember before garbage collecting.


spark.streaming.driver.writeAheadLog.closeFileAfterWrite


false


Whether to close the file after writing a write ahead log record on the driver. Set this to ‘true‘ when you want to use S3 (or any file system that does not support flushing) for the metadata WAL on the driver.


spark.streaming.receiver.writeAheadLog.closeFileAfterWrite


false


Whether to close the file after writing a write ahead log record on the receivers. Set this to ‘true‘ when you want to use S3 (or any file system that does not support flushing) for the data WAL on the receivers.

时间: 2024-08-17 05:09:02

SparkStreaming参数配置的相关文章

线上机器JVM参数配置

记录一下线上机器的JVM参数配置: CATALINA_OPTS="$CATALINA_OPTS -server -Djava.awt.headless=true -Xms2560m [JVM初始分配的堆内存 2.5G]-Xmx2560m [JVM最大可用堆内存 2.5G]-Xss256k [每个线程的堆栈大小]-XX:PermSize=128m [永久代大小]-XX:MaxPermSize=384m [永久代最大值]-XX:NewSize=1024m [新生代初始内存大小]-XX:MaxNewS

YARN日志聚合相关参数配置

日志聚合是YARN提供的日志中央化管理功能,它能将运行完成的Container/任务日志上传到HDFS上,从而减轻NodeManager负载,且提供一个中央化存储和分析机制.默认情况下,Container/任务日志存在在各个NodeManager上,如果启用日志聚合功能需要额外的配置. 参数配置yarn-site.xml 1.yarn.log-aggregation-enable 参数说明:是否启用日志聚合功能,日志聚合开启后保存到HDFS上. 默认值:false 2.yarn.log-aggr

nginx一些参数配置详解

nginx的配置:    正常运行的必备配置:       1.user username [groupname];           指定运行worker进程的用户和组       2.pid /path/to/pidfile_name nginx的pid文件 3.worker_rlimit_nofile #;            一个worker进程所能够打开的最大文件句柄数:       4.worker_rlimit_sigpending #;            设定每个用户能够

教你如何利用分布式的思想处理集群的参数配置信息——spring的configurer妙用

引言 最近LZ的技术博文数量直线下降,实在是非常抱歉,之前LZ曾信誓旦旦的说一定要把<深入理解计算机系统>写完,现在看来,LZ似乎是在打自己脸了.尽管LZ内心一直没放弃,但从现状来看,需要等LZ的PM做的比较稳定,时间慢慢空闲出来的时候才有机会看了.短时间内,还是要以解决实际问题为主,而不是增加自己其它方面的实力. 因此,本着解决实际问题的目的,LZ就研究出一种解决当下问题的方案,可能文章的标题看起来挺牛B的,其实LZ就是简单的利用了一下分布式的思想,以及spring框架的特性,解决了当下的参

grunt-nodemon参数配置

grunt-nodemon参数配置 nodemon0.2.0版本后参数名称做了较大改动,调整了下nodemon的参数配置,有需要的同学可以参考下: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 nodemon: {      dev: {           script: 'app.js',           options: {                args: [],                nodeArgs: ['--debug'],

HttpClient 4.3连接池参数配置及源码解读

目前所在公司使用HttpClient 4.3.3版本发送Rest请求,调用接口.最近出现了调用查询接口服务慢的生产问题,在排查整个调用链可能存在的问题时(从客户端发起Http请求->ESB->服务端处理请求,查询数据并返回),发现原本的HttpClient连接池中的一些参数配置可能存在问题,如defaultMaxPerRoute.一些timeout时间的设置等,虽不能确定是由于此连接池导致接口查询慢,但确实存在可优化的地方,故花时间做一些研究.本文主要涉及HttpClient连接池.请求的参数

linux串口编程参数配置详解

1.linux串口编程需要的头文件 #include <stdio.h>         //标准输入输出定义 #include <stdlib.h>        //标准函数库定义 #include <unistd.h>       //Unix标准函数定义 #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h>          //文件控制定义 #incl

IntellIJ IDEA 启动 参数 配置

系统环境: 型号名称: MacBook Pro型号标识符: MacBookPro11,4处理器名称: Intel Core i7处理器速度: 2.8 GHz处理器数目: 1核总数: 4L2 缓存(每个核): 256 KBL3 缓存: 6 MB内存: 16 GB 软件版本: IntelliJ IDEA 2017.2.2Build #IU-172.3757.52, built on August 15, 2017Licensed to phpdragon JRE: 1.8.0_152-release

Linux 下configure 参数配置与软件的安装与卸载

Linux环境下的软件安装,并不是一件容易的事情:如果通过源代码编译后在安装,当然事情就更为复杂一些:现在安装各种软件的教程都非常普遍:但万变不离其中,对基础知识的扎实掌握,安装各种软件的问题就迎刃而解了.Configure脚本配置工具就是基础之一,它是autoconf的工具的基本应用. 'configure'脚本有大量的命令行选项.对不同的软件包来说,这些选项可能会有变化,但是许多基本的选项是不会改变的.带上'--help'选项执行'configure'脚本可以看到可用的所有选项.尽管许多选项