IDEA开发spark本地运行

1.建立spakTesk项目,建立scala对象Test

2.Tesk对象的代码如下

package sparkTest

/**
 * Created by jiahong on 15-8-2.
 */
import org.apache.spark.{SparkConf,SparkContext}
object Test {
  def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println("Usage: <file>")
      System.exit(1)
    }

    val conf=new SparkConf().setAppName("Test").setMaster("local")
    val sc=new SparkContext(conf)

    val rdd=sc.textFile("/home/jiahong/sparkWorkSpace/input")

    //统计单词个数,然后按个数从高到低排序
    val result=rdd.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).map(x=>(x._2,x._1)).sortByKey(false).map(x=>(x._2,x._1))

    result.saveAsTextFile("/home/jiahong/sparkWorkSpace/output")

    print(result)

  }

}

3.设置本地运行,在IDEA的右上角-点开Edit Configurations

4.设置本地运行,在Vm options:上填写:-Dspark.master=local ,Program arguments上填写:local

5.点击run运行,run前先开启本机的spark

/usr/lib/jdk/jdk1.7.0_79/bin/java -Dspark.master=local -Didea.launcher.port=7532 -Didea.launcher.bin.path=/home/jiahong/idea-IC-141.1532.4/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jdk/jdk1.7.0_79/jre/lib/resources.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/jfxrt.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/charsets.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/jsse.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/rt.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/plugin.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/deploy.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/jfr.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/javaws.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/management-agent.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/jce.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/zipfs.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/dnsns.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/sunec.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/sunjce_provider.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/sunpkcs11.jar:/usr/lib/jdk/jdk1.7.0_79/jre/lib/ext/localedata.jar:/home/jiahong/IdeaProjects/sparkTest/out/production/sparkTest:/home/jiahong/apache/spark-1.3.1-bin-hadoop2.6/lib/spark-assembly-1.3.1-hadoop2.6.0.jar:/home/jiahong/apache/scala-2.10.4/lib/scala-actors-migration.jar:/home/jiahong/apache/scala-2.10.4/lib/scala-reflect.jar:/home/jiahong/apache/scala-2.10.4/lib/scala-actors.jar:/home/jiahong/apache/scala-2.10.4/lib/scala-swing.jar:/home/jiahong/apache/scala-2.10.4/lib/scala-library.jar:/home/jiahong/idea-IC-141.1532.4/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain sparkTest.Test local
Using Spark‘s default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/02 10:58:14 INFO SparkContext: Running Spark version 1.3.1
15/08/02 10:58:14 WARN Utils: Your hostname, jiahong-OptiPlex-7010 resolves to a loopback address: 127.0.1.1; using 192.168.199.187 instead (on interface eth0)
15/08/02 10:58:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/02 10:58:14 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/02 10:58:15 INFO SecurityManager: Changing view acls to: jiahong
15/08/02 10:58:15 INFO SecurityManager: Changing modify acls to: jiahong
15/08/02 10:58:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jiahong); users with modify permissions: Set(jiahong)
15/08/02 10:58:15 INFO Slf4jLogger: Slf4jLogger started
15/08/02 10:58:15 INFO Remoting: Starting remoting
15/08/02 10:58:15 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:37917]
15/08/02 10:58:15 INFO Utils: Successfully started service ‘sparkDriver‘ on port 37917.
15/08/02 10:58:15 INFO SparkEnv: Registering MapOutputTracker
15/08/02 10:58:15 INFO SparkEnv: Registering BlockManagerMaster
15/08/02 10:58:15 INFO DiskBlockManager: Created local directory at /tmp/spark-a2cbde0d-0951-4a95-80df-a99a14127efc/blockmgr-3cbdae80-810a-4ecf-b012-0979b3d714d0
15/08/02 10:58:15 INFO MemoryStore: MemoryStore started with capacity 469.5 MB
15/08/02 10:58:15 INFO HttpFileServer: HTTP File server directory is /tmp/spark-67629167-df98-4e7e-afa1-4dd36b655012/httpd-28cb8de9-caa4-4600-9704-347cea890b07
15/08/02 10:58:15 INFO HttpServer: Starting HTTP Server
15/08/02 10:58:15 INFO Server: jetty-8.y.z-SNAPSHOT
15/08/02 10:58:15 INFO AbstractConnector: Started [email protected]0.0.0.0:50336
15/08/02 10:58:15 INFO Utils: Successfully started service ‘HTTP file server‘ on port 50336.
15/08/02 10:58:15 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/02 10:58:15 INFO Server: jetty-8.y.z-SNAPSHOT
15/08/02 10:58:15 INFO AbstractConnector: Started [email protected]0.0.0.0:4040
15/08/02 10:58:15 INFO Utils: Successfully started service ‘SparkUI‘ on port 4040.
15/08/02 10:58:15 INFO SparkUI: Started SparkUI at http://jiahong-OptiPlex-7010.lan:4040
15/08/02 10:58:15 INFO Executor: Starting executor ID <driver> on host localhost
15/08/02 10:58:15 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://[email protected]:37917/user/HeartbeatReceiver
15/08/02 10:58:16 INFO NettyBlockTransferService: Server created on 40115
15/08/02 10:58:16 INFO BlockManagerMaster: Trying to register BlockManager
15/08/02 10:58:16 INFO BlockManagerMasterActor: Registering block manager localhost:40115 with 469.5 MB RAM, BlockManagerId(<driver>, localhost, 40115)
15/08/02 10:58:16 INFO BlockManagerMaster: Registered BlockManager
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(182921) called with curMem=0, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 178.6 KB, free 469.4 MB)
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(25432) called with curMem=182921, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.8 KB, free 469.3 MB)
15/08/02 10:58:16 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40115 (size: 24.8 KB, free: 469.5 MB)
15/08/02 10:58:16 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/08/02 10:58:16 INFO SparkContext: Created broadcast 0 from textFile at Test.scala:17
15/08/02 10:58:16 INFO FileInputFormat: Total input paths to process : 1
MapPartitionsRDD[7] at map at Test.scala:20
15/08/02 10:58:16 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
15/08/02 10:58:16 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
15/08/02 10:58:16 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
15/08/02 10:58:16 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
15/08/02 10:58:16 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
15/08/02 10:58:16 INFO SparkContext: Starting job: saveAsTextFile at Test.scala:24
15/08/02 10:58:16 INFO DAGScheduler: Registering RDD 3 (map at Test.scala:20)
15/08/02 10:58:16 INFO DAGScheduler: Registering RDD 5 (map at Test.scala:20)
15/08/02 10:58:16 INFO DAGScheduler: Got job 0 (saveAsTextFile at Test.scala:24) with 1 output partitions (allowLocal=false)
15/08/02 10:58:16 INFO DAGScheduler: Final stage: Stage 2(saveAsTextFile at Test.scala:24)
15/08/02 10:58:16 INFO DAGScheduler: Parents of final stage: List(Stage 1)
15/08/02 10:58:16 INFO DAGScheduler: Missing parents: List(Stage 1)
15/08/02 10:58:16 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[3] at map at Test.scala:20), which has no missing parents
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(3640) called with curMem=208353, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.6 KB, free 469.3 MB)
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(2614) called with curMem=211993, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.6 KB, free 469.3 MB)
15/08/02 10:58:16 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40115 (size: 2.6 KB, free: 469.5 MB)
15/08/02 10:58:16 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/08/02 10:58:16 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:839
15/08/02 10:58:16 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MapPartitionsRDD[3] at map at Test.scala:20)
15/08/02 10:58:16 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
15/08/02 10:58:16 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1301 bytes)
15/08/02 10:58:16 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/02 10:58:16 INFO HadoopRDD: Input split: file:/home/jiahong/sparkWorkSpace/input/test.txt:0+62
15/08/02 10:58:16 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2003 bytes result sent to driver
15/08/02 10:58:16 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 81 ms on localhost (1/1)
15/08/02 10:58:16 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/02 10:58:16 INFO DAGScheduler: Stage 0 (map at Test.scala:20) finished in 0.092 s
15/08/02 10:58:16 INFO DAGScheduler: looking for newly runnable stages
15/08/02 10:58:16 INFO DAGScheduler: running: Set()
15/08/02 10:58:16 INFO DAGScheduler: waiting: Set(Stage 1, Stage 2)
15/08/02 10:58:16 INFO DAGScheduler: failed: Set()
15/08/02 10:58:16 INFO DAGScheduler: Missing parents for Stage 1: List()
15/08/02 10:58:16 INFO DAGScheduler: Missing parents for Stage 2: List(Stage 1)
15/08/02 10:58:16 INFO DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[5] at map at Test.scala:20), which is now runnable
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(3080) called with curMem=214607, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.0 KB, free 469.3 MB)
15/08/02 10:58:16 INFO MemoryStore: ensureFreeSpace(2177) called with curMem=217687, maxMem=492337889
15/08/02 10:58:16 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 469.3 MB)
15/08/02 10:58:16 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:40115 (size: 2.1 KB, free: 469.5 MB)
15/08/02 10:58:16 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0
15/08/02 10:58:16 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/08/02 10:58:16 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MapPartitionsRDD[5] at map at Test.scala:20)
15/08/02 10:58:16 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
15/08/02 10:58:16 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1045 bytes)
15/08/02 10:58:16 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/08/02 10:58:16 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/08/02 10:58:16 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms
15/08/02 10:58:17 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1097 bytes result sent to driver
15/08/02 10:58:17 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 77 ms on localhost (1/1)
15/08/02 10:58:17 INFO DAGScheduler: Stage 1 (map at Test.scala:20) finished in 0.077 s
15/08/02 10:58:17 INFO DAGScheduler: looking for newly runnable stages
15/08/02 10:58:17 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/08/02 10:58:17 INFO DAGScheduler: running: Set()
15/08/02 10:58:17 INFO DAGScheduler: waiting: Set(Stage 2)
15/08/02 10:58:17 INFO DAGScheduler: failed: Set()
15/08/02 10:58:17 INFO DAGScheduler: Missing parents for Stage 2: List()
15/08/02 10:58:17 INFO DAGScheduler: Submitting Stage 2 (MapPartitionsRDD[8] at saveAsTextFile at Test.scala:24), which is now runnable
15/08/02 10:58:17 INFO MemoryStore: ensureFreeSpace(127696) called with curMem=219864, maxMem=492337889
15/08/02 10:58:17 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 124.7 KB, free 469.2 MB)
15/08/02 10:58:17 INFO MemoryStore: ensureFreeSpace(76648) called with curMem=347560, maxMem=492337889
15/08/02 10:58:17 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 74.9 KB, free 469.1 MB)
15/08/02 10:58:17 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:40115 (size: 74.9 KB, free: 469.4 MB)
15/08/02 10:58:17 INFO BlockManagerMaster: Updated info of block broadcast_3_piece0
15/08/02 10:58:17 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:839
15/08/02 10:58:17 INFO DAGScheduler: Submitting 1 missing tasks from Stage 2 (MapPartitionsRDD[8] at saveAsTextFile at Test.scala:24)
15/08/02 10:58:17 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/08/02 10:58:17 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1056 bytes)
15/08/02 10:58:17 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/08/02 10:58:17 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
15/08/02 10:58:17 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
15/08/02 10:58:17 INFO FileOutputCommitter: Saved output of task ‘attempt_201508021058_0002_m_000000_2‘ to file:/home/jiahong/sparkWorkSpace/output/_temporary/0/task_201508021058_0002_m_000000
15/08/02 10:58:17 INFO SparkHadoopMapRedUtil: attempt_201508021058_0002_m_000000_2: Committed
15/08/02 10:58:17 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 1828 bytes result sent to driver
15/08/02 10:58:17 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 138 ms on localhost (1/1)
15/08/02 10:58:17 INFO DAGScheduler: Stage 2 (saveAsTextFile at Test.scala:24) finished in 0.138 s
15/08/02 10:58:17 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
15/08/02 10:58:17 INFO DAGScheduler: Job 0 finished: saveAsTextFile at Test.scala:24, took 0.483353 s
MapPartitionsRDD[7] at map at Test.scala:20
Process finished with exit code 0

6.结果如下:

input目录下有个test.txt文件,内容如下

运行之后,output目录下文件如下:

  

注意:

一开始运行时,可能会碰到如下问题

Exception in thread "main" java.lang.NoSuchMethodError: 

解决办法是,在启动你的spark是,观看scala的版本是多少,然后你在本机安装对应的版本,最后在IDEA上修改过来。

我之前本机安装的是2.11.7版本,导致错误,最后查看spark的scala版本为2.10.4,我重新安装了,然后再在IDEA上修改过来,就可以正确运行了!

时间: 2024-10-07 13:53:04

IDEA开发spark本地运行的相关文章

spark本地运行模式

Spark设置setMaster=local,不提交集群,在本地启用多线程模拟运行 object SparkUtil {      private val logger = Logger.getLogger(getClass.getName, true)      def getSparkContext(appName:String, local:Boolean=false, threadNum:Int=4):SparkContext = {            val conf = new 

开发函数计算的正确姿势 —— 使用 Fun Local 本地运行与调试

前言首先介绍下在本文出现的几个比较重要的概念: 函数计算(Function Compute): 函数计算是一个事件驱动的服务,通过函数计算,用户无需管理服务器等运行情况,只需编写代码并上传.函数计算准备计算资源,并以弹性伸缩的方式运行用户代码,而用户只需根据实际代码运行所消耗的资源进行付费.函数计算更多信息 参考. Fun: Fun 是一个用于支持 Serverless 应用部署的工具,能帮助您便捷地管理函数计算.API 网关.日志服务等资源.它通过一个资源配置文件(template.yml),

大数据技术之_03_Hadoop学习_02_入门_Hadoop运行模式+【本地运行模式+伪分布式运行模式+完全分布式运行模式(开发重点)】+Hadoop编译源码(面试重点)+常见错误及解决方案

第4章 Hadoop运行模式4.1 本地运行模式4.1.1 官方Grep案例4.1.2 官方WordCount案例4.2 伪分布式运行模式4.2.1 启动HDFS并运行MapReduce程序4.2.2 启动YARN并运行MapReduce程序4.2.3 配置历史服务器4.2.4 配置日志的聚集4.2.5 配置文件说明4.3 完全分布式运行模式(开发重点)4.3.1 虚拟机准备4.3.2 编写集群分发脚本xsync4.3.3 集群配置4.3.4 集群单点启动4.3.5 SSH无密登录配置4.3.6

底层战详解使用Java开发Spark程序(DT大数据梦工厂)

Scala开发Spark很多,为什么还要用Java开发原因:1.一般Spark作为数据处理引擎,一般会跟IT其它系统配合,现在业界里面处于霸主地位的是Java,有利于团队的组建,易于移交:2.Scala学习角度讲,比Java难.找Scala的高手比Java难,项目的维护和二次开发比较困难:3.很多人员有Java的基础,确保对Scala不是很熟悉的人可以编写课程中的案例预测:2016年Spark取代Map Reduce,拯救HadoopHadoop+Spark = A winning combat

搭建scala 开发spark程序环境及实例演示

上一篇博文已经介绍了搭建scala的开发环境,现在进入正题.如何开发我们的第一个spark程序. 下载spark安装包,下载地址http://spark.apache.org/downloads.html(因为开发环境需要引用spark的jar包) 我下载的是spark-2.1.0-bin-hadoop2.6.tgz,因为我的scalaIDE版本是scala-SDK-4.5.0-vfinal-2.11-win32.win32.x86_64.zip 最好,IDE版本和spark版本要匹配,否则,开

使用scala开发spark入门总结

使用scala开发spark入门总结 一.spark简单介绍 关于spark的介绍网上有很多,可以自行百度和google,这里只做简单介绍.推荐简单介绍连接:http://blog.jobbole.com/89446/ 1.    spark是什么? Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架.一般配合hadoop使用,可以增强hadoop的计算性能. 2.    Spark的优点有哪些? Sp

通过案例对 spark streaming 透彻理解三板斧之三:spark streaming运行机制与架构

本期内容: 1. Spark Streaming Job架构与运行机制 2. Spark Streaming 容错架构与运行机制 事实上时间是不存在的,是由人的感官系统感觉时间的存在而已,是一种虚幻的存在,任何时候宇宙中的事情一直在发生着的. Spark Streaming好比时间,一直遵循其运行机制和架构在不停的在运行,无论你写多或者少的应用程序都跳不出这个范围. 一.   通过案例透视Job执行过程的Spark Streaming机制解析,案例代码如下: import org.apache.

spark 任务运行原理

调优概述 在开发完Spark作业之后,就该为作业配置合适的资源了.Spark的资源参数,基本都可以在spark-submit命令中作为参数设置.很多Spark初学者,通常不知道该设置哪些必要的参数,以及如何设置这些参数,最后就只能胡乱设置,甚至压根儿不设置.资源参数设置的不合理,可能会导致没有充分利用集群资源,作业运行会极其缓慢:或者设置的资源过大,队列没有足够的资源来提供,进而导致各种异常.总之,无论是哪种情况,都会导致Spark作业的运行效率低下,甚至根本无法运行.因此我们必须对Spark作

spark2.x由浅入深深到底系列五之python开发spark环境配置

学习spark任何的技术前,请先正确理解spark,可以参考: 正确理解spark 以下是在mac操作系统上配置用python开发spark的环境 一.安装python spark2.2.0需要python的版本是Python2.6+ 或者 Python3.4+ 可以参考: http://jingyan.baidu.com/article/7908e85c78c743af491ad261.html 二.下载spark编译包并配置环境变量 1.在官网中: http://spark.apache.o