跟我一起数据挖掘(22)——spark入门

Spark简介

Spark是UC Berkeley AMP lab所开源的类Hadoop MapReduce的通用的并行,Spark,拥有Hadoop MapReduce所具有的优点;但不同于MapReduce的是Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好地适用于数据挖掘与机器学习等需要迭代的map reduce的算法。

Spark优点

Spark是基于内存,是云计算领域的继Hadoop之后的下一代的最热门的通用的并行计算框架开源项目,尤其出色的支持Interactive Query、流计算、图计算等。
Spark在机器学习方面有着无与伦比的优势,特别适合需要多次迭代计算的算法。同时Spark的拥有非常出色的容错和调度机制,确保系统的稳定运行,Spark目前的发展理念是通过一个计算框架集合SQL、Machine Learning、Graph Computing、Streaming Computing等多种功能于一个项目中,具有非常好的易用性。目前SPARK已经构建了自己的整个大数据处理生态系统,如流处理、图技术、机器学习、NoSQL查询等方面都有自己的技术,并且是Apache顶级Project,可以预计的是2014年下半年在社区和商业应用上会有爆发式的增长。Spark最大的优势在于速度,在迭代处理计算方面比Hadoop快100倍以上;Spark另外一个无可取代的优势是:“One Stack to rule them all”,Spark采用一个统一的技术堆栈解决了云计算大数据的所有核心问题,这直接奠定了其一统云计算大数据领域的霸主地位;

下图是使用逻辑回归算法的使用时间:

Spark目前支持scala、python、JAVA编程。

作为Spark的原生语言,scala是开发Spark应用程序的首选,其优雅简洁的代码,令开发过mapreduce代码的码农感觉象是上了天堂。

可以架构在hadoop之上,读取hadoop、hbase数据。

spark的部署方式

1、standalone模式,即独立模式,自带完整的服务,可单独部署到一个集群中,无需依赖任何其他资源管理系统。

2、Spark On Mesos模式。这是很多公司采用的模式,官方推荐这种模式(当然,原因之一是血缘关系)。

3、Spark On YARN模式。这是一种最有前景的部署模式。

spark本机安装

流程:进入linux->安装JDK->安装scala->安装spark。

JDK的安装和配置(略)。

安装scala,进入http://www.scala-lang.org/download/下载。

下载后解压缩。

tar zxvf scala-2.11.6.tgz
//改名
mv scala-2.11.6 scala
//设置配置
export SCALA_HOME=/home/hadoop/software/scala
export PATH=$SCALA_HOME/bin;$PATH

source /etc/profile

scala -version
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL

scala设置成功。

http://spark.apache.org/downloads.html下载spark并安装。

下载后解压缩。

进入$SPARK_HOME/bin,运行

./run-example SparkPi

运行结果

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Using Spark‘s default log4j profile: org/apache/spark/log4j-defaults.properties
15/03/14 23:41:40 INFO SparkContext: Running Spark version 1.3.0
15/03/14 23:41:40 WARN Utils: Your hostname, localhost.localdomain resolves to a loopback address: 127.0.0.1; using 192.168.126.147 instead (on interface eth0)
15/03/14 23:41:40 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/03/14 23:41:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/03/14 23:41:41 INFO SecurityManager: Changing view acls to: hadoop
15/03/14 23:41:41 INFO SecurityManager: Changing modify acls to: hadoop
15/03/14 23:41:41 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
15/03/14 23:41:42 INFO Slf4jLogger: Slf4jLogger started
15/03/14 23:41:42 INFO Remoting: Starting remoting
15/03/14 23:41:42 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:60926]
15/03/14 23:41:42 INFO Utils: Successfully started service ‘sparkDriver‘ on port 60926.
15/03/14 23:41:42 INFO SparkEnv: Registering MapOutputTracker
15/03/14 23:41:43 INFO SparkEnv: Registering BlockManagerMaster
15/03/14 23:41:43 INFO DiskBlockManager: Created local directory at /tmp/spark-285a6144-217c-442c-bfde-4b282378ac1e/blockmgr-f6cb0d15-d68d-4079-a0fe-9ec0bf8297a4
15/03/14 23:41:43 INFO MemoryStore: MemoryStore started with capacity 265.1 MB
15/03/14 23:41:43 INFO HttpFileServer: HTTP File server directory is /tmp/spark-96b3f754-9cad-4ef8-9da7-2a2c5029c42a/httpd-b28f3f6d-73f7-46d7-9078-7ba7ea84ca5b
15/03/14 23:41:43 INFO HttpServer: Starting HTTP Server
15/03/14 23:41:43 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/14 23:41:43 INFO AbstractConnector: Started [email protected]:42548
15/03/14 23:41:43 INFO Utils: Successfully started service ‘HTTP file server‘ on port 42548.
15/03/14 23:41:43 INFO SparkEnv: Registering OutputCommitCoordinator
15/03/14 23:41:43 INFO Server: jetty-8.y.z-SNAPSHOT
15/03/14 23:41:43 INFO AbstractConnector: Started [email protected]:4040
15/03/14 23:41:43 INFO Utils: Successfully started service ‘SparkUI‘ on port 4040.
15/03/14 23:41:43 INFO SparkUI: Started SparkUI at http://192.168.126.147:4040
15/03/14 23:41:44 INFO SparkContext: Added JAR file:/home/hadoop/software/spark-1.3.0-bin-hadoop2.4/lib/spark-examples-1.3.0-hadoop2.4.0.jar at http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1426347704488
15/03/14 23:41:44 INFO Executor: Starting executor ID <driver> on host localhost
15/03/14 23:41:44 INFO AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://[email protected]:60926/user/HeartbeatReceiver
15/03/14 23:41:44 INFO NettyBlockTransferService: Server created on 39408
15/03/14 23:41:44 INFO BlockManagerMaster: Trying to register BlockManager
15/03/14 23:41:44 INFO BlockManagerMasterActor: Registering block manager localhost:39408 with 265.1 MB RAM, BlockManagerId(<driver>, localhost, 39408)
15/03/14 23:41:44 INFO BlockManagerMaster: Registered BlockManager
15/03/14 23:41:45 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
15/03/14 23:41:45 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 2 output partitions (allowLocal=false)
15/03/14 23:41:45 INFO DAGScheduler: Final stage: Stage 0(reduce at SparkPi.scala:35)
15/03/14 23:41:45 INFO DAGScheduler: Parents of final stage: List()
15/03/14 23:41:45 INFO DAGScheduler: Missing parents: List()
15/03/14 23:41:45 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing parents
15/03/14 23:41:45 INFO MemoryStore: ensureFreeSpace(1848) called with curMem=0, maxMem=278019440
15/03/14 23:41:45 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1848.0 B, free 265.1 MB)
15/03/14 23:41:45 INFO MemoryStore: ensureFreeSpace(1296) called with curMem=1848, maxMem=278019440
15/03/14 23:41:45 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1296.0 B, free 265.1 MB)
15/03/14 23:41:45 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:39408 (size: 1296.0 B, free: 265.1 MB)
15/03/14 23:41:45 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/03/14 23:41:45 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:839
15/03/14 23:41:45 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31)
15/03/14 23:41:45 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/03/14 23:41:45 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1340 bytes)
15/03/14 23:41:45 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1340 bytes)
15/03/14 23:41:45 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/03/14 23:41:45 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/03/14 23:41:45 INFO Executor: Fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar with timestamp 1426347704488
15/03/14 23:41:45 INFO Utils: Fetching http://192.168.126.147:42548/jars/spark-examples-1.3.0-hadoop2.4.0.jar to /tmp/spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/userFiles-96c6db61-e95e-4f9e-a6c4-0db892583854/fetchFileTemp5600234414438914634.tmp
15/03/14 23:41:46 INFO Executor: Adding file:/tmp/spark-db1e742b-020f-4db1-9ee3-f3e2d90e1bc2/userFiles-96c6db61-e95e-4f9e-a6c4-0db892583854/spark-examples-1.3.0-hadoop2.4.0.jar to class loader
15/03/14 23:41:47 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 736 bytes result sent to driver
15/03/14 23:41:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 736 bytes result sent to driver
15/03/14 23:41:47 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1560 ms on localhost (1/2)
15/03/14 23:41:47 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 1540 ms on localhost (2/2)
15/03/14 23:41:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/03/14 23:41:47 INFO DAGScheduler: Stage 0 (reduce at SparkPi.scala:35) finished in 1.578 s
15/03/14 23:41:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 2.099817 s
Pi is roughly 3.14438
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
15/03/14 23:41:47 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
15/03/14 23:41:47 INFO SparkUI: Stopped Spark web UI at http://192.168.126.147:4040
15/03/14 23:41:47 INFO DAGScheduler: Stopping DAGScheduler
15/03/14 23:41:47 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!
15/03/14 23:41:47 INFO MemoryStore: MemoryStore cleared
15/03/14 23:41:47 INFO BlockManager: BlockManager stopped
15/03/14 23:41:47 INFO BlockManagerMaster: BlockManagerMaster stopped
15/03/14 23:41:47 INFO OutputCommitCoordinator$OutputCommitCoordinatorActor: OutputCommitCoordinator stopped!
15/03/14 23:41:47 INFO SparkContext: Successfully stopped SparkContext
15/03/14 23:41:47 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/03/14 23:41:47 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.

可以看到输出结果为3.14438。

时间: 2024-10-10 02:51:55

跟我一起数据挖掘(22)——spark入门的相关文章

Spark入门实战系列--8.Spark MLlib(下)--机器学习库SparkMLlib实战

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1.MLlib实例 1.1 聚类实例 1.1.1 算法说明 聚类(Cluster analysis)有时也被翻译为簇类,其核心任务是:将一组目标object划分为若干个簇,每个簇之间的object尽可能相似,簇与簇之间的object尽可能相异.聚类算法是机器学习(或者说是数据挖掘更合适)中重要的一部分,除了最为简单的K-Means聚类算法外,比较常见的还有层次法(CURE.CHAMELEON等).网

Spark入门实战系列--8.Spark MLlib(上)--机器学习及SparkMLlib简介

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1.机器学习概念 1.1 机器学习的定义 在维基百科上对机器学习提出以下几种定义: l“机器学习是一门人工智能的科学,该领域的主要研究对象是人工智能,特别是如何在经验学习中改善具体算法的性能”. l“机器学习是对能通过经验自动改进的计算机算法的研究”. l“机器学习是用数据或以往的经验,以此优化计算机程序的性能标准.” 一种经常引用的英文定义是:A computer program is said

使用scala开发spark入门总结

使用scala开发spark入门总结 一.spark简单介绍 关于spark的介绍网上有很多,可以自行百度和google,这里只做简单介绍.推荐简单介绍连接:http://blog.jobbole.com/89446/ 1.    spark是什么? Spark是UC Berkeley AMP lab (加州大学伯克利分校的AMP实验室)所开源的类Hadoop MapReduce的通用并行框架.一般配合hadoop使用,可以增强hadoop的计算性能. 2.    Spark的优点有哪些? Sp

Spark入门实战系列--2.Spark编译与部署(中)--Hadoop编译安装

[注]该系列文章以及使用到安装包/測试数据 能够在<[倾情大奉送–Spark入门实战系列] (http://blog.csdn.net/yirenboy/article/details/47291765)>获取 1 编译Hadooop 1.1 搭建好开发环境 1.1.1 安装并设置maven 1.下载maven安装包.建议安装3.0以上版本号,本次安装选择的是maven3.0.5的二进制包,下载地址例如以下 http://mirror.bit.edu.cn/apache/maven/maven

Spark入门实战系列--3.Spark编程模型(上)--概念及SparkShell实战

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送–Spark入门实战系列>获取 1 Spark编程模型 1.1 术语定义 应用程序(Application): 基于Spark的用户程序,包含了一个Driver Program 和集群中多个的Executor: 驱动程序(Driver Program):运行Application的main()函数并且创建SparkContext,通常用SparkContext代表Driver Program: 执行单元(Executor): 是为某

Spark入门实战系列--1.Spark及其生态圈简介

[注]该系列文章以及使用到安装包/测试数据 可以在<倾情大奉送--Spark入门实战系列>获取 1.简介 1.1 Spark简介 Spark是加州大学伯克利分校AMP实验室(Algorithms, Machines, and People Lab)开发通用内存并行计算框架.Spark在2013年6月进入Apache成为孵化项目,8个月后成为Apache顶级项目,速度之快足见过人之处,Spark以其先进的设计理念,迅速成为社区的热门项目,围绕着Spark推出了Spark SQL.Spark St

Apache spark入门精品文章

本文聚焦 Apache Spark 入门,了解其在大数据领域的地位,覆盖 Apache Spark 的安装及应用程序的建立,并解释一些常见的行为和操作. 一. 为什么要使用 Apache Spark 时下,我们正处在一个“大数据”的时代,每时每刻,都有各种类型的数据被生产.而在此紫外,数据增幅的速度也在显著增加.从广义上看,这些数据包含交易数据.社交媒体内容(比如文本.图像和视频)以及传感器数据.那么,为什么要在这些内容上投入如此多精力,其原因无非就是从海量数据中提取洞见可以对生活和生产实践进行

Spark入门(Python版)

Hadoop是对大数据集进行分布式计算的标准工具,这也是为什么当你穿过机场时能看到”大数据(Big Data)”广告的原因.它已经成为大数据的操作系统,提供了包括工具和技巧在内的丰富生态系统,允许使用相对便宜的商业硬件集群进行超级计算机级别的计 算.2003和2004年,两个来自Google的观点使Hadoop成为可能:一个分布式存储框架(Google文件系统),在Hadoop中被实现为HDFS:一个分布式计算框架(MapReduce). 这两个观点成为过去十年规模分析(scaling anal

Spark入门——什么是Hadoop,为什么是Spark?

#Spark入门#这个系列课程,是综合于我从2017年3月分到今年7月份为止学习并使用Spark的使用心得感悟,暂定于每周更新,以后可能会上传讲课视频和PPT,目前先在博客园把稿子打好.注意:这只是一个草稿,里面关于知识的误解还请各大网友监督,我们互相进步.总而言之,网络上的知识学会断舍离,学会带着辩证的眼光去学习就能进步. 谈到Spark,实际上从16年本科实习的时候就已经开始接触,那个时候是从Pig入门大数据Hadoop的,有兴趣的朋友可以看Hadoop家族之Pig入门:当时大家基本都在使用

Spark入门(Python)

Hadoop是对大数据集进行分布式计算的标准工具,这也是为什么当你穿过机场时能看到”大数据(Big Data)”广告的原因.它已经成为大数据的操作系统,提供了包括工具和技巧在内的丰富生态系统,允许使用相对便宜的商业硬件集群进行超级计算机级别的计算.2003和2004年,两个来自Google的观点使Hadoop成为可能:一个分布式存储框架(Google文件系统),在Hadoop中被实现为HDFS:一个分布式计算框架(MapReduce). 这两个观点成为过去十年规模分析(scaling analy