Spark Programming--Fundamental operation

sc.parallelize():创建RDD,建议使用xrange

getNumPartitions():获取分区数

glom():以分区为单位返回list

collect():返回list(一般是返回driver program)

例子:

sc.textFile(path):读取文件,返回RDD

官网函数:textFile(nameminPartitions=Noneuse_unicode=True)

支持读取文件:a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

例子(本地文件读取)

时间: 2024-10-05 18:28:47

Spark Programming--Fundamental operation的相关文章

Spark1.1.0 Spark Programming Guide

Spark Programming Guide Overview Linking with Spark Initializing Spark Using the Shell Resilient Distributed Datasets (RDDs) Parallelized Collections External Datasets RDD Operations Basics Passing Functions to Spark Working with Key-Value Pairs Tran

Spark Programming Guide 中文版

Spark Guide Programming Guide 中文翻译 : Git地址:https://github.com/ChenZhongPu/SparkGuideGitBook GitBook 地址:http://chenzhongpu.gitbooks.io/sparkguide/

<Spark><Programming><Key/Value Pairs><RDD>

Working with key/value Pairs Motivation Pair RDDs are a useful building block in many programs, as they expose operations that allow u to act on each key in parallel or regroup data across network. Eg: pair RDDs have a reduceByKey() method that can a

Spark1.1.0 Spark Streaming Programming Guide

Spark Streaming Programming Guide Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (DStreams) Input DStreams Transformations on DStreams Output Operations on DStreams Caching / Persistence Checkpointin

Apache Spark 2.2.0 中文文档 - GraphX Programming Guide | ApacheCN

GraphX Programming Guide 概述 入门 属性 Graph 示例属性 Graph Graph 运算符 运算符的汇总表 Property 运算符 Structural 运算符 Join 运算符 邻域聚合 聚合消息 (aggregateMessages) Map Reduce Triplets Transition Guide (Legacy) 计算级别信息 收集相邻点 Caching and Uncaching Pregel API Graph 建造者 Vertex and E

【转载】Getting Started with Spark (in Python)

Getting Started with Spark (in Python) Benjamin Bengfort Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has becom

Spark学习笔记

Spark 阅读官方文档 Spark Quick Start Spark Programming Guide Spark SQL, DataFrames and Datasets Guide Cluster Mode Overview Spark Standalone Mode 重要的概念:resilient distributed dataset (RDD), a collection of elements partitioned across the nodes of the cluste

spark 笔记 2: Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解,没有比这个更好的资料了.必读. Abstract RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to s

Apache Spark 2.2.0 中文文档 - Spark Streaming 编程指南 | ApacheCN

Spark Streaming 编程指南 概述 一个入门示例 基础概念 依赖 初始化 StreamingContext Discretized Streams (DStreams)(离散化流) Input DStreams 和 Receivers(接收器) DStreams 上的 Transformations(转换) DStreams 上的输出操作 DataFrame 和 SQL 操作 MLlib 操作 缓存 / 持久性 Checkpointing Accumulators, Broadcas

Spark Memory Tuning (内存调优)

调整 Spark 应用程序的内存使用情况和 GC behavior 已经有很多的讨论在 Tuning Guide 中.我们强烈建议您阅读一下.在本节中, 我们将在 Spark Streaming applications 的上下文中讨论一些 tuning parameters (调优参数). Spark Streaming application 所需的集群内存量在很大程度上取决于所使用的 transformations 类型.例如, 如果要在最近 10 分钟的数据中使用 window oper