【原创】大叔经验分享(39)spark cache unpersist级联操作

问题:spark中如果有两个DataFrame(或者DataSet),DataFrameA依赖DataFrameB,并且两个DataFrame都进行了cache,将DataFrameB unpersist之后,DataFrameA的cache也会失效,官方解释如下:

When invalidating a cache, we invalid other caches dependent on this cache to ensure cached data is up to date. For example, when the underlying table has been modified or the table has been dropped itself, all caches that use this table should be invalidated or refreshed.

However, in other cases, like when user simply want to drop a cache to free up memory, we do not need to invalidate dependent caches since no underlying data has been changed. For this reason, we would like to introduce a new cache invalidation mode: the non-cascading cache invalidation.

之前默认的模式为regular mode,这种模式下为了保证被cache数据是最新的(没有过期),会对cache的unpersist进行级联操作,即清空所有依赖(包括间接依赖)该cache的其他cache;
从spark2.4开始引入了一个新的模式:non-cascading mode,这个模式下不会对cache的unpersist进行级联操作;

DataFrame/DataSet的cache操作默认用的level是MEMORY_AND_DISK,除非手工指定MEMORY,并且确认内存足够,否则unpersist之前的cache看起来没有必要;

参考:
https://issues.apache.org/jira/browse/SPARK-21478
https://issues.apache.org/jira/browse/SPARK-24596
https://issues.apache.org/jira/browse/SPARK-21579

原文地址:https://www.cnblogs.com/barneywill/p/10524805.html

时间: 2024-11-05 22:32:12

【原创】大叔经验分享(39)spark cache unpersist级联操作的相关文章

【原创】经验分享(15)spark sql limit实现原理

之前讨论过hive中limit的实现,详见 https://www.cnblogs.com/barneywill/p/10109217.html下面看spark sql中limit的实现,首先看执行计划: spark-sql> explain select * from test1 limit 10;== Physical Plan ==CollectLimit 10+- HiveTableScan [id#35], MetastoreRelation temp, test1Time taken

【原创】大叔经验分享(23)hive metastore的几种部署方式

hive及其他组件(比如spark.impala等)都会依赖hive metastore,依赖的配置文件位于hive-site.xml hive metastore重要配置 hive.metastore.warehouse.dirhive2及之前版本默认为/user/hive/warehouse/,创建数据库或表时会在该目录下创建对应的目录 javax.jdo.option.ConnectionURLjavax.jdo.option.ConnectionDriverNamejavax.jdo.o

【原创】大叔经验分享(35)lzo格式支持

建表语句 CREATE EXTERNAL TABLE `my_lzo_table`(`something` string)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputForma

【原创】大叔经验分享(58)kudu写入压力大时报错

kudu写入压力大时报错 19/05/18 16:53:12 INFO AsyncKuduClient: Invalidating location fd52e4f930bc45458a8f29ed118785e3(server002:7050) for tablet 4259921cdcca4776b37771659a8cafb3: Service unavailable: Soft memory limit exceeded (at 106.05% of capacity). See htt

【原创】大叔经验分享(52)ClouderaManager修改配置报错

Cloudera Manager中修改配置可能报错: Incorrect string value: '\xE7\xA8\x8B\xE5\xBA\x8F...' for column 'MESSAGE' at row 1 这是一个mysql的字符集问题,极有可能创建scm数据库时使用默认的latin1编码导致,涉及的表为: CREATE TABLE `REVISIONS` ( `REVISION_ID` bigint(20) NOT NULL, `OPTIMISTIC_LOCK_VERSION`

【原创】大叔经验分享(53)kudu报错unable to find SASL plugin: PLAIN

kudu安装后运行不正常,master中找不到任何tserver,查看tserver日志发现有很多报错: Failed to heartbeat to master:7051: Invalid argument: Failed to ping master at master:7051: Client connection negotiation failed: client connection to master:7051: unable to find SASL plugin: PLAIN

【原创】大叔经验分享(55)hue导出行数限制

/opt/cloudera/parcels/CDH/lib/hue/apps/beeswax/src/beeswax/conf.py # Deprecated DOWNLOAD_CELL_LIMIT = Config( key='download_cell_limit', default=10000000, type=int, help=_t('A limit to the number of cells (rows * columns) that can be downloaded from

【原创】大叔经验分享(57)hue启动coordinator时报错

hue启动coordinator时报错,页面返回undefinied错误框: 后台日志报错: runcpserver.log [13/May/2019 04:34:55 -0700] middleware INFO Processing exception: 'NoneType' object has no attribute 'is_superuser': Traceback (most recent call last): File "/opt/cloudera/parcels/CDH-5.

【原创】大叔经验分享(61)kudu rebalance报错

kudu rebalance命令报错 terminate called after throwing an instance of 'std::regex_error' what(): regex_error *** Aborted at 1558779043 (unix time) try "date -d @1558779043" if you are using GNU date *** PC: @ 0x7ff0d6cf9207 __GI_raise *** SIGABRT (@