Spark Programming--Actions II

saveAsTextFile

saveAsTextFile(path, compressionCodecClass=None)

aveAsTextFile用于将RDD以文本文件的格式存储到文件系统中，将每一个元素以string格式存储（结合python的loads和dumps可以很好应用）

Parameters:

path – path to text file
compressionCodecClass – (None by default) string i.e. “org.apache.hadoop.io.compress.GzipCodec“ 指定压缩的类名

例子：

saveAsSequenceFile

sequenceFile(path, keyClass=None, valueClass=None, keyConverter=None, valueConverter=None, minSplits=None, batchSize=0)

Parameters:

path – path to sequncefile
keyClass – fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClass – fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverter –
valueConverter –
minSplits – minimum splits in dataset (default min(2, sc.defaultParallelism))
batchSize – The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

saveAsSequenceFile用于将RDD以SequenceFile的文件格式保存到HDFS上

存储的时候会默认存储到hdfs上面，会保留原始格式

例子：

查看hdfs上文件，以及get下来后看文件格式：

saveAsHadoopFile

saveAsHadoopDataset

saveAsNewAPIHadoopFile

saveAsNewAPIHadoopDataset

时间： 2024-11-09 16:29:57

Spark Programming--Actions II的相关文章

Spark1.1.0 Spark Programming Guide

Spark Programming Guide Overview Linking with Spark Initializing Spark Using the Shell Resilient Distributed Datasets (RDDs) Parallelized Collections External Datasets RDD Operations Basics Passing Functions to Spark Working with Key-Value Pairs Tran

5 Shell Scripts for Linux Newbies to Learn Shell Programming – Part II

To Learn something you need to do it, without the fear of being unsuccessful. I believe in practicality and hence will be accompanying you to the practical world of Scripting Language. Learn Basic Shell Scripting This article is an extension of our F

Spark Programming Guide 中文版

Spark Guide Programming Guide 中文翻译 : Git地址:https://github.com/ChenZhongPu/SparkGuideGitBook GitBook 地址:http://chenzhongpu.gitbooks.io/sparkguide/

<Spark><Programming><Key/Value Pairs><RDD>

Working with key/value Pairs Motivation Pair RDDs are a useful building block in many programs, as they expose operations that allow u to act on each key in parallel or regroup data across network. Eg: pair RDDs have a reduceByKey() method that can a

Spark1.1.0 Spark Streaming Programming Guide

Spark Streaming Programming Guide Overview A Quick Example Basic Concepts Linking Initializing StreamingContext Discretized Streams (DStreams) Input DStreams Transformations on DStreams Output Operations on DStreams Caching / Persistence Checkpointin

Apache Spark 2.2.0 中文文档 - GraphX Programming Guide | ApacheCN

GraphX Programming Guide 概述入门属性 Graph 示例属性 Graph Graph 运算符运算符的汇总表 Property 运算符 Structural 运算符 Join 运算符邻域聚合聚合消息 (aggregateMessages) Map Reduce Triplets Transition Guide (Legacy) 计算级别信息收集相邻点 Caching and Uncaching Pregel API Graph 建造者 Vertex and E

Spark学习笔记

Spark 阅读官方文档 Spark Quick Start Spark Programming Guide Spark SQL, DataFrames and Datasets Guide Cluster Mode Overview Spark Standalone Mode 重要的概念:resilient distributed dataset (RDD), a collection of elements partitioned across the nodes of the cluste

【转载】Getting Started with Spark (in Python)

Getting Started with Spark (in Python) Benjamin Bengfort Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has becom

spark 笔记 2： Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf ucb关于spark的论文,对spark中核心组件RDD最原始.本质的理解,没有比这个更好的资料了.必读. Abstract RDDs provide a restricted form of shared memory, based on coarse grained transformations rather than fine-grained updates to s

Spark调研笔记第5篇 - Spark API简单介绍

因为Spark是用Scala实现的,所以Spark天生支持Scala API.此外,还支持Java和Python API. 以Spark 1.3版本号的Python API为例.其模块层级关系例如以下图所看到的: 从上图可知,pyspark是Python API的顶层package,它包括了几个重要的subpackages.当中: 1) pyspark.SparkContext 它抽象了指向spark集群的一条连接,可用来创建RDD对象,它是API的主入口. 2) pyspark.SparkCo