spark实例1--wordCount

IDE: scala版的 Eclipse

scala version:2.10.4

spark:1.1.1

文件内容:

hello  world

hello word

world word  hello

1、新建scala工程

2、引入spark的jar包

3、代码

import  org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._  
 
object WordCount {
  def main(args:Array[String]){
    val conf=new SparkConf().setAppName("word Count").setMaster("local")
    val sc=new SparkContext(conf)
    val textFile=sc.textFile("test.txt")
    val mapRdd = textFile.flatMap(line=>line.split(",")).map(x=>(x,1)).reduceByKey(_+_)
    mapRdd.collect().foreach(println)
  }

}

4、运行

在spark的bin目录里运行spark-shell,待spark启动后,

运行结果:

(hello,3)
(word,2)
(world,2)

代码分析:

1、import org.apache.spark.SparkContext._  这句的作用 引入隐式转换

不然会出现value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, Int)]

2、map 和flatMap区别

看3个代码

textFile.map(_.split(",")).collect().foreach(println)
    textFile.map(_.split(",")).collect().foreach(x=> println(x.mkString(",")))
    textFile.flatMap(_.split(",")).collect().foreach(println)

输出分别为:

[Ljava.lang.String;@736caf7a
[Ljava.lang.String;@4ce7fffa
[Ljava.lang.String;@497486b3  

hello,world
hello,word
world,word,hello  

hello
world
hello
word
world
word
hello

从代码1和代码2可以看出map的结果应该是:

Array(Array("hello","world"),Array("hello","word"),Array("world","word","hello"))

flatMap的输出结果应该是

Array("hello","world","hello","word","world","word","hello")

flatMap就是在map基础上平铺展开

[Ljava.lang.String;@736caf7a这个是什么类型,在scala的命令行界面输入

val a=Array("hello")

println(a)  输出  [Ljava.lang.String;@cc4a0dd

3、整个程序的流程中各个环节的输出

flatMap--》map--》reduceByKey

Array("hello","world","hello","word","world","word","hello")---》

Array[Strng,Int](("hello",1),("world",1),("hello",1),("word",1),("world",1),("word",1),("hello",1))-->

Array[String,Int](("hello",3),("world",2),("word",2))

4、如果求出现次数最多的单词

flatMap--》map--》reduceByKey--》reduce

val maxNum= mapRdd.reduce((a,b)=>if (a._2>b._2) a else b)

println(maxNum)

输出("hello",3)

时间: 2024-07-30 15:04:40

spark实例1--wordCount的相关文章

Spark入门之WordCount详细版

1 package cn.spark.study.core; 2 3 import java.util.Arrays; 4 5 import org.apache.spark.SparkConf; 6 import org.apache.spark.api.java.JavaPairRDD; 7 import org.apache.spark.api.java.JavaRDD; 8 import org.apache.spark.api.java.JavaSparkContext; 9 impo

提交任务到spark(以wordcount为例)

1.首先需要搭建好hadoop+spark环境,并保证服务正常.本文以wordcount为例. 2.创建源文件,即输入源.hello.txt文件,内容如下: tom jerry henry jim suse lusy 注:以空格为分隔符 3.然后执行如下命令: hadoop fs -mkdir -p /Hadoop/Input(在HDFS创建目录) hadoop fs -put hello.txt /Hadoop/Input(将hello.txt文件上传到HDFS) hadoop fs -ls

Hadoop2.4.x 实例测试 WordCount程序

 在实例测试前先确保hadoop 启动正确 Master.Hadoop: word 1[[email protected] input]$ jps6736 Jps6036 NameNode4697 SecondaryNameNode4849 ResourceManager[[email protected] input]$ Slave1.Hadoop [[email protected] sources]$ jps8086 SecondaryNameNode8961 Jps8320 NodeMa

spark配置和word-count

Spark ------------ 快如闪电集群计算引擎. 应用于大规模数据处理快速通用引擎. 内存计算. [Speed] 计算速度是hadoop的100x. Spark有高级DAG(Direct acycle graph,有向无环图)执行引擎. [易于使用] 使用java,scala,python,R,SQL编写App. 提供了80+高级算子,能够轻松构建并行应用. 也可以使用scala,python,r的shell进行交互式操作 [通用性] 对SQL,流计算,复杂分析进行组合应用. spa

一、spark入门之spark shell:wordcount

1.安装完spark,进入spark中bin目录: bin/spark-shell scala> val textFile = sc.textFile("/Users/admin/spark/spark-1.6.1-bin-hadoop2.6/README.md") scala> textFile.flatMap(_.split(" ")).filter(!_.isEmpty).map((_,1)).reduceByKey(_+_).collect().

Spark中的wordCount程序实现

import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaPairRDD; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.FlatMapFunction; import org.apache

spark实例2--sparkPi

代码: import org.apache.spark.SparkConfimport org.apache.spark.SparkContext import scala.math.random object SparkPi {  def main(args:Array[String]){    val conf=new SparkConf().setAppName("SparkPi").setMaster("local")    val sc=new Spark

Oozie调用Spark实例

oozie调用spark有三样是必须有的: workflow.xml(不可改名) , job.properties(可改名) , jar包(自备). 1.workflow.xml workflow.xml需要放到hdfs中 2.job.properties job.properties放在本地目录中即可. 3.运行: oozie job -config job.properties -run -oozie http://地址:11000/oozie

hadoop之运行官方实例二--WordCount

1.在hadoop-2.9.2目录下新建一个wcinput:mkdir wcinput 2.在wcinput下新建一个文件:touch wc.input 3.vim wc.input,在wc.input中输入: hadoop yarn hadoop mapreduce gong gong 4.回到hadoop-2.9.2目录下,输入:hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.2.jar wordcount wc