Spark高级

Spark java.lang.OutOfMemoryError: Java heap space

My cluster: 1 master, 11 slaves, each node has 6 GB memory.

My settings:
spark.executor.memory=4g, Dspark.akka.frameSize=512
Here is the problem:

First, I read some data (2.19 GB) from HDFS to RDD:
val imageBundleRDD = sc.newAPIHadoopFile(...)
Second, do something on this RDD:

val res = imageBundleRDD.map(data => {
val desPoints = threeDReconstruction(data._2, bg)
(data._1, desPoints)
})
Last, output to HDFS:

res.saveAsNewAPIHadoopFile(...)
When I run my program it shows:

.....
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:24 as TID 33 on executor 9: Salve7.Hadoop (NODE_LOCAL)
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:24 as 30618515 bytes in 210 ms
14/01/15 21:42:27 INFO cluster.ClusterTaskSetManager: Starting task 1.0:36 as TID 34 on executor 2: Salve11.Hadoop (NODE_LOCAL)
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Serialized task 1.0:36 as 30618515 bytes in 449 ms
14/01/15 21:42:28 INFO cluster.ClusterTaskSetManager: Starting task 1.0:32 as TID 35 on executor 7: Salve4.Hadoop (NODE_LOCAL)
Uncaught error from thread [spark-akka.actor.default-dispatcher-3] shutting down JVM since ‘akka.jvm-exit-on-fatal-error‘ is enabled for ActorSystem[spark]

I have a few suggestions:

  • If your nodes are configured to have 6g maximum for Spark (and are leaving a little for other processes), then use 6g rather than 4g, spark.executor.memory=6g. Make sure you‘re using as much memory as possible by checking the UI (it will say how much mem you‘re using)
  • Try using more partitions, you should have 2 - 4 per CPU. IME increasing the number of partitions is often the easiest way to make a program more stable (and often faster). For huge amounts of data you may need way more than 4 per CPU, I‘ve had to use 8000 partitions in some cases!
  • Decrease the fraction of memory reserved for caching, using spark.storage.memoryFraction. If you don‘t use cache() or persist in your code, this might as well be 0. It‘s default is 0.6, which means you only get 0.4 * 4g memory for your heap. IME reducing the mem frac often makes OOMs go away. UPDATE: From spark 1.6 apparently we will no longer need to play with these values, spark will determine them automatically.
  • Similar to above but shuffle memory fraction. If your job doesn‘t need much shuffle memory then set it to a lower value (this might cause your shuffles to spill to disk which can have catastrophic impact on speed). Sometimes when it‘s a shuffle operation that‘s OOMing you need to do the opposite i.e. set it to something large, like 0.8, or make sure you allow your shuffles to spill to disk (it‘s the default since 1.0.0).
  • Watch out for memory leaks, these are often caused by accidentally closing over objects you don‘t need in your lambdas. The way to diagnose is to look out for the "task serialized as XXX bytes" in the logs, if XXX is larger than a few k or more than an MB, you may have a memory leak. See http://stackoverflow.com/a/25270600/1586965
  • Related to above; use broadcast variables if you really do need large objects.
  • If you are caching large RDDs and can sacrifice some access time consider serialising the RDDhttp://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage. Or even caching them on disk (which sometimes isn‘t that bad if using SSDs).
  • (Advanced) Related to above, avoid String and heavily nested structures (like Map and nested case classes). If possible try to only use primitive types and index all non-primitives especially if you expect a lot of duplicates. Choose WrappedArray over nested structures whenever possible. Or even roll out your own serialisation - YOU will have the most information regarding how to efficiently back your data into bytes, USE IT!
  • (bit hacky) Again when caching, consider using a Dataset to cache your structure as it will use more efficient serialisation. This should be regarded as a hack when compared to the previous bullet point. Building your domain knowledge into your algo/serialisation can minimise memory/cache-space by 100x or 1000x, whereas all a Dataset will likely give is 2x - 5x in memory and 10x compressed (parquet) on disk.

http://spark.apache.org/docs/1.2.1/configuration.html

EDIT: (So I can google myself easier) The following is also indicative of this problem:

java.lang.OutOfMemoryError : GC overhead limit exceeded

Answer2:

Have a look at the start up scripts a Java heap size is set there, it looks like you‘re not setting this before running Spark worker.

# Set SPARK_MEM if it isn‘t already set since we also use it for this process
SPARK_MEM=${SPARK_MEM:-512m}
export SPARK_MEM

# Set JAVA_OPTS to be able to load native libraries and to set heap size
JAVA_OPTS="$OUR_JAVA_OPTS"
JAVA_OPTS="$JAVA_OPTS -Djava.library.path=$SPARK_LIBRARY_PATH"
JAVA_OPTS="$JAVA_OPTS -Xms$SPARK_MEM -Xmx$SPARK_MEM"

You can find the documentation to deploy scripts here.

 
时间: 2024-11-08 12:52:21

Spark高级的相关文章

《Spark高级数据分析》pdf格式下载免费电子书下载

<Spark高级数据分析>pdf格式下载免费电子书下载https://u253469.ctfile.com/fs/253469-300325651 更多电子书下载: http://hadoopall.com/book 内容简介 本书是使用Spark进行大规模数据分析的实战宝典,由著名大数据公司Cloudera的数据科学家撰写.四位作者首先结合数据科学和大数据分析的广阔背景讲解了Spark,然后介绍了用Spark和Scala进行数据处理的基础知识,接着讨论了如何将Spark用于机器学习,同时介绍

Spark高级数据分析-第2章 用Scala和Spark进行数据分析

2.4 小试牛刀:Spark shell和SparkContext 本章使用的资料来自加州大学欧文分校机器学习资料库(UC Irvine Machine Learning Repository),这个资料库为研究和教学提供了大量非常好的数据源, 这些数据源非常有意义,并且是免费的.由于网络原因,无法从原始地址下载数据集,这里可以从以下链接获取: https://pan.baidu.com/s/1dENp41V 或 http://pan.baidu.com/s/1c29fBVy 获取数据集以后,可

spark高级排序彻底解秘

排序,真的非常重要! RDD.scala(源码) 在其,没有罗列排序,不是说它不重要! 1.基础排序算法实战 2.二次排序算法实战 3.更高级别排序算法 4.更高级别排序算法 1.基础排序算法实战 启动hdfs集群 [email protected]:/usr/local/hadoop/hadoop-2.6.0$ sbin/start-dfs.sh 启动spark集群 [email protected]:/usr/local/spark/spark-1.5.2-bin-hadoop2.6$ sb

第19课:Spark高级排序彻底解密

本节课内容: 1.基础排序算法实战 2.二次排序算法实战 3.更高级别排序算法 4.排序算法内幕解密 排序在Spark运用程序中使用的比较多,且维度也不一样,如二次排序,三次排序等,在机器学习算法中经常碰到,所以非常重要,必须掌握! 所谓二次排序,就是根据两列值进行排序,如下测试数据: 2 3 4 1 3 2 4 3 8 7 2 1 经过二次排序后的结果(升序): 2 1 2 3 3 2 4 1 4 3 8 7 在编写二次排序代码前,先简单的写下单个key排序的代码: val conf = ne

Spark高级排序彻底解密(DT大数据梦工厂)

内容: 1.基础排序算法实战: 2.二次排序算法实战: 3.更高局级别排序算法: 4.排序算法内幕解密: 为啥讲排序?因为在应用的时候都有排序要求. 海量数据经常排序之后要我们想要的内容. ==========基础排序算法============ scala> sc.setLogLevel("WARN") scala> val x = sc.textFile("/historyserverforSpark/README.md", 3).flatMap(_

Spark高级排序与TopN问题揭密

[TOC] 引入 前面进行过wordcount的单词统计例子,关键是,如何对统计的单词按照单词个数来进行排序? 如下: scala> val retRDD = sc.textFile("hdfs://ns1/hello").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_) scala> val retSortRDD = retRDD.map(pair => (pair._2, pair._1)).

Spark大型项目实战:电商用户行为分析大数据平台

本项目主要讲解了一套应用于互联网电商企业中,使用Java.Spark等技术开发的大数据统计分析平台,对电商网站的各种用户行为(访问行为.页面跳转行为.购物行为.广告点击行为等)进行复杂的分析.用统计分析出来的数据,辅助公司中的PM(产品经理).数据分析师以及管理人员分析现有产品的情况,并根据用户行为分析结果持续改进产品的设计,以及调整公司的战略和业务.最终达到用大数据技术来帮助提升公司的业绩.营业额以及市场占有率的目标. 1.课程研发环境 开发工具: Eclipse Linux:CentOS 6

为什么spark中只有ALS

WRMF is like the classic rock of implicit matrix factorization. It may not be the trendiest, but it will never go out of style --Ethan Rosenthal 前言 spark平台推出至今已经地带到2.1的版本了,很多地方都有了重要的更新,加入了很多新的东西.但是在协同过滤这一块却一直以来都只有ALS一种算法.同样是大规模计算平台,Hadoop中的机器学习算法库Mah

王家林 大数据Spark超经典视频链接全集[转]

压缩过的大数据Spark蘑菇云行动前置课程视频百度云分享链接 链接:http://pan.baidu.com/s/1cFqjQu SCALA专辑 Scala深入浅出经典视频 链接:http://pan.baidu.com/s/1i4Gh3Xb 密码:25jc DT大数据梦工厂大数据spark蘑菇云Scala语言全集(持续更新中) http://www.tudou.com/plcover/rd3LTMjBpZA/ 1 Spark视频王家林第1课:大数据时代的“黄金”语言Scala 2 Spark视