Win7上Spark WordCount运行过程及异常

WordCount.Scala代码如下:

package com.husor.Spark

/**
 * Created by huxiu on 2014/11/26.
 */

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.SparkContext._

object SparkWordCount {

  def main(args: Array[String]) {

    println("Test is starting......")

    System.setProperty("hadoop.home.dir", "d:\\winutil\\")

    //val conf = new SparkConf().setAppName("WordCount")
    //                          .setMaster("spark://Master:7077")
    //                          .setSparkHome("SPARK_HOME")
    //                          .set("spark.cores.max","2")

    //val spark = new SparkContext(conf)
    //以本地模式运行WordCount程序    val spark = new SparkContext("local","WordCount",System.getenv("SPARK_HOME"))     val file = spark.textFile("hdfs://Master:9000/data/test1") 
   //将输出结果直接输出到控制台上
    //file.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)

    val wordCounts = file.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_)
    //将输出结果直接输出到hdfs上    wordCounts.saveAsTextFile("hdfs://Master:9000/user/huxiu/WordCountOutput") 
   spark.stop()

   println("Test is Succeed!!!") 

  } }

执行上述WordCount过程中,所遇异常如下:

Exception 1:

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries ...............

Reason: Hadoop Bug

Solution: http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7

Namely,

1) download compiled winutils.exe from
http://social.msdn.microsoft.com/Forums/windowsazure/en-US/28a57efb-082b-424b-8d9e-731b1fe135de/please-read-if-experiencing-job-failures?forum=hdinsight
2) put this file into d:\winutil\bin
3) add in my test: System.setProperty("hadoop.home.dir", "d:\\winutil\\")

Exception 2:

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: user=huxiu, access=WRITE, inode="/":Spark:supergroup:drwxr-xr-x

at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:265)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:251)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:232)
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:176)

Reason:

From the above exception it is easy to see that a job is trying to a create a directory using the username huxiu under a directory named "/" folder is owned by user  Spark who belongs to group named supergroup . Since other users don’t have access to the "/" folder(  rwxr-xr-x ) writes under "/" fails for  huxiu

Solution: http://www.hadoopinrealworld.com/fixing-org-apache-hadoop-security-accesscontrolexception-permission-denied/

Namely,

1> To keep things clean and for better control lets specify the location of the staging directory by setting the mapreduce.jobtracker.staging.root.dir  property in mapred-site.xml . After the property is set, restart mapred service for the property to take effect.

1 <property>
2     <name>mapreduce.jobtracker.staging.root.dir</name>
3     <value>/user</value>
4 </property>

2> I have seen several suggestions online suggesting to do a chmod on /user to 777. This is not advisable as doing so will give other users access to delete or modify other users files in HDFS. Instead create a folder named huxiu  under/user  using the root user (in our case it is Spark ) in HDFS. After creating the folder, change the folder permissions to huxiu.

1 [[email protected] hadoop]$ hadoop fs -mkdir /user
2 14/11/26 14:00:13 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
3 [[email protected] hadoop]$ hadoop fs -mkdir /user/huxiu
4 14/11/26 14:04:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
5 [[email protected] hadoop]$ hadoop fs -chown huxiu:huxiu /user/huxiu
6 14/11/26 14:04:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

更改异常后,成功运行结果如下:

"C:\Program Files\Java\jdk1.7.0_67\bin\java" -Didea.launcher.port=7546 "-Didea.launcher.bin.path=D:\ScalaIDE\IntelliJ IDEA Community Edition 14.0.1\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.7.0_67\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\jce.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\jfxrt.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\resources.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\rt.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.7.0_67\jre\lib\ext\zipfs.jar;D:\IntelliJ_IDE\WorkSpace\out\production\Test;D:\vagrant\data\Scala2.10.4\lib\scala-actors-migration.jar;D:\vagrant\data\Scala2.10.4\lib\scala-actors.jar;D:\vagrant\data\Scala2.10.4\lib\scala-library.jar;D:\vagrant\data\Scala2.10.4\lib\scala-reflect.jar;D:\vagrant\data\Scala2.10.4\lib\scala-swing.jar;D:\SparkSrc\spark-assembly-1.1.0-hadoop2.4.0.jar;D:\ScalaIDE\IntelliJ IDEA Community Edition 14.0.1\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain com.husor.Spark.SparkWordCount
Test is starting......
Using Spark‘s default log4j profile: org/apache/spark/log4j-defaults.properties
14/11/26 14:15:44 INFO SecurityManager: Changing view acls to: huxiu,
14/11/26 14:15:44 INFO SecurityManager: Changing modify acls to: huxiu,
14/11/26 14:15:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(huxiu, ); users with modify permissions: Set(huxiu, )
14/11/26 14:15:44 INFO Slf4jLogger: Slf4jLogger started
14/11/26 14:15:44 INFO Remoting: Starting remoting
14/11/26 14:15:44 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:54972]
14/11/26 14:15:44 INFO Remoting: Remoting now listens on addresses: [akka.tcp://[email protected]:54972]
14/11/26 14:15:44 INFO Utils: Successfully started service ‘sparkDriver‘ on port 54972.
14/11/26 14:15:44 INFO SparkEnv: Registering MapOutputTracker
14/11/26 14:15:45 INFO SparkEnv: Registering BlockManagerMaster
14/11/26 14:15:45 INFO DiskBlockManager: Created local directory at C:\Users\huxiu\AppData\Local\Temp\spark-local-20141126141545-9dad
14/11/26 14:15:45 INFO Utils: Successfully started service ‘Connection manager for block manager‘ on port 54975.
14/11/26 14:15:45 INFO ConnectionManager: Bound socket to port 54975 with id = ConnectionManagerId(huxiu-PC,54975)
14/11/26 14:15:45 INFO MemoryStore: MemoryStore started with capacity 969.6 MB
14/11/26 14:15:45 INFO BlockManagerMaster: Trying to register BlockManager
14/11/26 14:15:45 INFO BlockManagerMasterActor: Registering block manager huxiu-PC:54975 with 969.6 MB RAM
14/11/26 14:15:45 INFO BlockManagerMaster: Registered BlockManager
14/11/26 14:15:45 INFO HttpFileServer: HTTP File server directory is C:\Users\huxiu\AppData\Local\Temp\spark-423dcd83-624e-404a-bbf6-a1190f77290f
14/11/26 14:15:45 INFO HttpServer: Starting HTTP Server
14/11/26 14:15:45 INFO Utils: Successfully started service ‘HTTP file server‘ on port 54976.
14/11/26 14:15:45 INFO Utils: Successfully started service ‘SparkUI‘ on port 4040.
14/11/26 14:15:45 INFO SparkUI: Started SparkUI at http://huxiu-PC:4040
14/11/26 14:15:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/11/26 14:15:45 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://[email protected]:54972/user/HeartbeatReceiver
14/11/26 14:15:46 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=1016667832
14/11/26 14:15:46 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 969.4 MB)
14/11/26 14:15:46 INFO FileInputFormat: Total input paths to process : 1
14/11/26 14:15:46 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
14/11/26 14:15:46 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
14/11/26 14:15:46 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
14/11/26 14:15:46 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
14/11/26 14:15:46 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
14/11/26 14:15:47 INFO SparkContext: Starting job: saveAsTextFile at SparkWordCount.scala:33
14/11/26 14:15:47 INFO DAGScheduler: Registering RDD 3 (map at SparkWordCount.scala:32)
14/11/26 14:15:47 INFO DAGScheduler: Got job 0 (saveAsTextFile at SparkWordCount.scala:33) with 1 output partitions (allowLocal=false)
14/11/26 14:15:47 INFO DAGScheduler: Final stage: Stage 0(saveAsTextFile at SparkWordCount.scala:33)
14/11/26 14:15:47 INFO DAGScheduler: Parents of final stage: List(Stage 1)
14/11/26 14:15:47 INFO DAGScheduler: Missing parents: List(Stage 1)
14/11/26 14:15:47 INFO DAGScheduler: Submitting Stage 1 (MappedRDD[3] at map at SparkWordCount.scala:32), which has no missing parents
14/11/26 14:15:47 INFO MemoryStore: ensureFreeSpace(3424) called with curMem=163705, maxMem=1016667832
14/11/26 14:15:47 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.3 KB, free 969.4 MB)
14/11/26 14:15:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (MappedRDD[3] at map at SparkWordCount.scala:32)
14/11/26 14:15:47 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
14/11/26 14:15:47 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, ANY, 1174 bytes)
14/11/26 14:15:47 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
14/11/26 14:15:47 INFO HadoopRDD: Input split: hdfs://Master:9000/data/test1:0+27
14/11/26 14:15:47 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 1862 bytes result sent to driver
14/11/26 14:15:47 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 0) in 489 ms on localhost (1/1)
14/11/26 14:15:47 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
14/11/26 14:15:47 INFO DAGScheduler: Stage 1 (map at SparkWordCount.scala:32) finished in 0.500 s
14/11/26 14:15:47 INFO DAGScheduler: looking for newly runnable stages
14/11/26 14:15:47 INFO DAGScheduler: running: Set()
14/11/26 14:15:47 INFO DAGScheduler: waiting: Set(Stage 0)
14/11/26 14:15:47 INFO DAGScheduler: failed: Set()
14/11/26 14:15:47 INFO DAGScheduler: Missing parents for Stage 0: List()
14/11/26 14:15:47 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[5] at saveAsTextFile at SparkWordCount.scala:33), which is now runnable
14/11/26 14:15:47 INFO MemoryStore: ensureFreeSpace(57512) called with curMem=167129, maxMem=1016667832
14/11/26 14:15:47 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 56.2 KB, free 969.4 MB)
14/11/26 14:15:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[5] at saveAsTextFile at SparkWordCount.scala:33)
14/11/26 14:15:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
14/11/26 14:15:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 948 bytes)
14/11/26 14:15:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 1)
14/11/26 14:15:47 INFO BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 50331648, targetRequestSize: 10066329
14/11/26 14:15:47 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
14/11/26 14:15:47 INFO BlockFetcherIterator$BasicBlockFetcherIterator: Started 0 remote fetches in 4 ms
14/11/26 14:15:48 INFO FileOutputCommitter: Saved output of task ‘attempt_201411261415_0000_m_000000_1‘ to hdfs://Master:9000/user/huxiu/WordCountOutput/_temporary/0/task_201411261415_0000_m_000000
14/11/26 14:15:48 INFO SparkHadoopWriter: attempt_201411261415_0000_m_000000_1: Committed
14/11/26 14:15:48 INFO Executor: Finished task 0.0 in stage 0.0 (TID 1). 826 bytes result sent to driver
14/11/26 14:15:48 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 1) in 847 ms on localhost (1/1)
14/11/26 14:15:48 INFO DAGScheduler: Stage 0 (saveAsTextFile at SparkWordCount.scala:33) finished in 0.847 s
14/11/26 14:15:48 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
14/11/26 14:15:48 INFO SparkContext: Job finished: saveAsTextFile at SparkWordCount.scala:33, took 1.469630513 s
14/11/26 14:15:48 INFO SparkUI: Stopped Spark web UI at http://huxiu-PC:4040
14/11/26 14:15:48 INFO DAGScheduler: Stopping DAGScheduler
14/11/26 14:15:49 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
14/11/26 14:15:49 INFO ConnectionManager: Selector thread was interrupted!
14/11/26 14:15:49 INFO ConnectionManager: ConnectionManager stopped
14/11/26 14:15:49 INFO MemoryStore: MemoryStore cleared
14/11/26 14:15:49 INFO BlockManager: BlockManager stopped
14/11/26 14:15:49 INFO BlockManagerMaster: BlockManagerMaster stopped
Test is Succeed!!!
14/11/26 14:15:49 INFO SparkContext: Successfully stopped SparkContext
14/11/26 14:15:49 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
14/11/26 14:15:49 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
14/11/26 14:15:49 INFO Remoting: Remoting shut down
14/11/26 14:15:49 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.

Process finished with exit code 0
时间: 2024-11-05 12:09:07

Win7上Spark WordCount运行过程及异常的相关文章

spark记录(5)Spark运行流程及在不同集群中的运行过程

摘自:https://www.cnblogs.com/qingyunzong/p/8945933.html 一.Spark中的基本概念 (1)Application:表示你的应用程序 (2)Driver:表示main()函数,创建SparkContext.由SparkContext负责与ClusterManager通信,进行资源的申请,任务的分配和监控等.程序执行完毕后关闭SparkContext (3)Executor:某个Application运行在Worker节点上的一个进程,该进程负责运

spark 任务运行原理

调优概述 在开发完Spark作业之后,就该为作业配置合适的资源了.Spark的资源参数,基本都可以在spark-submit命令中作为参数设置.很多Spark初学者,通常不知道该设置哪些必要的参数,以及如何设置这些参数,最后就只能胡乱设置,甚至压根儿不设置.资源参数设置的不合理,可能会导致没有充分利用集群资源,作业运行会极其缓慢:或者设置的资源过大,队列没有足够的资源来提供,进而导致各种异常.总之,无论是哪种情况,都会导致Spark作业的运行效率低下,甚至根本无法运行.因此我们必须对Spark作

Update:Spark原理_运行过程_高级特性

如何判断宽窄依赖: =================================== 6. Spark 底层逻辑 导读 从部署图了解 Spark 部署了什么, 有什么组件运行在集群中 通过对 WordCount 案例的解剖, 来理解执行逻辑计划的生成 通过对逻辑执行计划的细化, 理解如何生成物理计划   如无特殊说明, 以下部分均针对于 Spark Standalone 进行介绍 部署情况 在 Spark 部分的底层执行逻辑开始之前, 还是要先认识一下 Spark 的部署情况, 根据部署情

Spark源码系列(三)作业运行过程

导读 看这篇文章的时候,最好是能够跟着代码一起看,我是边看代码边写的,所以这篇文章的前进过程也就是我看代码的推进过程. 作业执行 上一章讲了RDD的转换,但是没讲作业的运行,它和Driver Program的关系是啥,和RDD的关系是啥? 官方给的例子里面,一执行collect方法就能出结果,那我们就从collect开始看吧,进入RDD,找到collect方法. def collect(): Array[T] = { val results = sc.runJob(this, (iter: It

简单C程序在IA-32 CPU上运行过程的分析

本文将通过编译器生成的汇编代码分析C程序在IA-32体系PC上的运行流程 实验环境: gcc 4.8.2 C语言程序的内存结构 C代码如下 int g(int x) { return x + 1; } int f(int x) { return g(x); } int main(void) { return f(2) + 3; } 使用编译命令gcc -S -O0 -o main.s main.c -m32编译出汇编文件,如下 g: pushl %ebp movl %esp, %ebp movl

Spark wordcount开发并提交到集群运行

使用的ide是eclipse package com.luogankun.spark.base import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ /** * 统计字符出现次数 */ object WorkCount { def main(args: Array[String]) { if (args.length < 1) {

Win7上Git安装及配置过程

Win7上Git安装及配置过程 文档名称 Win7上Git安装及配置过程 创建时间 2012/8/20 修改时间 2012/8/20 创建人 Baifx 简介(收获) 1.在win7上安装msysgit步骤: 2.在win7上安装TortoiseGit步骤: 3.在VS2010中集成Git方法和步骤. 参考源 Git的配置与使用 http://wenku.baidu.com/view/929d7b4e2e3f5727a5e962a8.html 一.安装说明 1.Git在windows平台上安装说

hadoop wordCount运行

本文以康哥的博客为基础进行修改和补充,详见:http://kangfoo.github.io/article/2014/01/hadoop1.x-wordcount-fen-xi/ hadoop mapreduce 过程粗略的分为两个阶段: 1.map; 2.redurce(copy, sort, reduce) 具体的工作机制还是挺复杂的,这里主要通过hadoop example jar中提供的wordcount来对hadoop mapredurce做个简单的理解. Wordcount程序输入

Spark 的 Shuffle过程介绍`

Spark的Shuffle过程介绍 Shuffle Writer Spark丰富了任务类型,有些任务之间数据流转不需要通过Shuffle,但是有些任务之间还是需要通过Shuffle来传递数据,比如wide dependency的group by key. Spark中需要Shuffle输出的Map任务会为每个Reduce创建对应的bucket,Map产生的结果会根据设置的partitioner得到对应的bucketId,然后填充到相应的bucket中去.每个Map的输出结果可能包含所有的Redu