读learning spark lighting

chapter 1 introduction to the analysis with spark

the conponents of Sparks

  spark core(contains the basic  functionality of sparks. spark Core  is also the  home to the APIs that defines the RDDs),

  spark sql(structured data ) is the package  for working with the structured data.it allow query  data via SQL as well as Apache hive , and it support many sources of data ,including Hive tables ,Parquet And jason.also allow developers to intermix SQL queries with the programatic data manipulation supported By the RDDs in Python ,java And Scala .

  spark streaming(real-time),enables processing the live of streaming of data.

  MLib(machine learning)

  GraphX(graph processing )is a library for manipulating the graph .

A Brief History of Spark

  spark is a open source project that has beed And is maintained By a thriving And diverse community  of developer .

chapter 2 downloading spark And getting started

  walking through the process of downloding And running the sprak on local mode on single computer .

  you don‘t needmaster Scala,java orPython.Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark
on either your laptop or a cluster, all you need is an installation of Java 6 or newer. If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer).Spark does not yet work with Python 3.

downloading spark,select the "pre-build for Hadoop 2.4 And later".

tips:

widows user May run into issues installing .you can use the ziptool untar the .tar file Note :instatll spark in a directionalry with no space (e.g. c:\spark).

after you untar you will get a new directionaru with the same name but without the final .tar suffix .

damn it:

Most of this book includes code in all of Spark’s languages, but interactive shells are
available only in Python and Scala. Because a shell is very useful for learning the API, we recommend using one of these languages for these examples even if you are a Java
developer. The API is similar in every language.

change the directionaty to the spark,type bin\pyspark,you will see the logo.

时间: 2024-10-08 14:26:30

读learning spark lighting的相关文章

Spark的Python和Scala shell介绍(翻译自Learning.Spark.Lightning-Fast.Big.Data.Analysis)

Spark提供了交互式shell,交互式shell让我们能够点对点(原文:ad hoc)数据分析.如果你已经使用过R,Python,或者Scala中的shell,或者操作系统shell(例如bash),又或者Windows的命令提示符界面,你将会对Spark的shell感到熟悉. 但实际上Spark shell与其它大部分shell都不一样,其它大部分shell让你通过单个机器上的磁盘或者内存操作数据,Spark shell让你可以操作分布在很多机器上的磁盘或者内存里的数据,而Spark负责在集

【原】Learning Spark (Python版) 学习笔记(四)----Spark Sreaming与MLlib机器学习

本来这篇是准备5.15更的,但是上周一直在忙签证和工作的事,没时间就推迟了,现在终于有时间来写写Learning Spark最后一部分内容了. 第10-11 章主要讲的是Spark Streaming 和MLlib方面的内容.我们知道Spark在离线处理数据上的性能很好,那么它在实时数据上的表现怎么样呢?在实际生产中,我们经常需要即使处理收到的数据,比如实时机器学习模型的应用,自动异常的检测,实时追踪页面访问统计的应用等.Spark Streaming可以很好的解决上述类似的问题. 了解Spar

线性回归的Spark实现 [Linear Regression / Machine Learning / Spark]

1- 问题提出 2- 线性回归 3- 理论推导 4- Python/Spark实现 1 # -*- coding: utf-8 -*- 2 from pyspark import SparkContext 3 4 5 theta = [0, 0] 6 alpha = 0.001 7 8 sc = SparkContext('local') 9 10 def func_theta_x(x): 11 return sum([i * j for i, j in zip(theta, x)]) 12 1

Learning Spark: Lightning-Fast Big Data Analysis 中文翻译

Learning Spark: Lightning-Fast Big Data Analysis 中文翻译行为纯属个人对于Spark的兴趣,仅供学习. 如果我的翻译行为侵犯您的版权,请您告知,我将停止对此书的开源翻译. Translation the book of Learning Spark: Lightning-Fast Big Data Analysis is only for spark developer educational purposes. If I violated you

Learning Spark中文版--第三章--RDD编程(1)

? ?本章介绍了Spark用于数据处理的核心抽象概念,具有弹性的分布式数据集(RDD).一个RDD仅仅是一个分布式的元素集合.在Spark中,所有工作都表示为创建新的RDDs.转换现有的RDDs,或者调用RDDs上的操作来计算结果.在底层,Spark自动将数据中包含的数据分发到你的集群中,并将你对它们执行的操作进行并行化.数据科学家和工程师都应该阅读这一章,因为RDDs是Spark的核心概念.我们强烈建议你在这些例子中尝试一些 交互式shell(参见"Spark的Python和Scala she

Learning Spark——使用spark-shell运行Word Count

在hadoop.zookeeper.hbase.spark集群环境搭建中已经把环境搭建好了,工欲善其事必先利其器,现在器已经有了,接下来就要开搞了,先从spark-shell开始揭开Spark的神器面纱. spark-shell是Spark的命令行界面,我们可以在上面直接敲一些命令,就像windows的cmd一样,进入Spark安装目录,执行以下命令打开spark-shell: bin/spark-shell --master spark://hxf:7077 --executor-memory

单独的应用程序(翻译自Learning.Spark.Lightning-Fast.Big.Data.Analysis)

在这次对Spark粗略的讲解过程中,我们还没有讲如何在单独的应用程序中使用Spark.撇开交互式运行来说,我们能在Java,Scala或这Python程序中连接Spark.与在shell中连接Spark相比,唯一的区别是,在程序中,你需要自己初始化SparkContext . 连接Spark的过程因语言而异.在Java和Scala中,你在你的应用程序的Maven依赖中添加对spark-core 的依赖就可以了.到写这本书的时候,Spark的最新版是1.2.0,它对应的Maven坐标是: grou

Spark核心概念介绍(翻译自Learning.Spark.Lightning-Fast.Big.Data.Analysis)

既然你已经在shell里运行了你的第一个Spark代码片段,是时候来学习在shell里面编程的更多细节了. 从上层看,每一个Spark程序都是由一个驱动程序组成,这个驱动程序在集群上发布各种各样的平行操作.驱动程序包含你的应用程序的main函数,定义在集群上的分布式数据集,并且将一些操作作用在这些数据集上.在之前的例子中,驱动程序是Spark shell本身,你只需要在里面输入你想要运行的操作就行了. 驱动程序通过一个SparkContext 对象访问Spark,一个SparkContext 对

逻辑回归的分布式实现 [Logistic Regression / Machine Learning / Spark ]

1- 问题提出 2- 逻辑回归 3- 理论推导 4- Python/Spark实现 1 # -*- coding: utf-8 -*- 2 from pyspark import SparkContext 3 from math import * 4 5 theta = [0, 0, 0] #初始theta值 6 alpha = 0.001 #学习速率 7 8 def inner(x, y): 9 return sum([i*j for i,j in zip(x,y)]) 10 11 def f