大数据基础之ORC(1)简介

https://orc.apache.org

Optimized Row Columnar (ORC) file

层次结构:

file -> stripes -> row groups(10000 rows)

行列混合存储

Background

Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. The focus was on enabling high speed processing and reducing file sizes.

ORC是为了加速hive查询以及节省hadoop磁盘空间而生的;

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

ORC是一种自描述的列式存储文件类型,它为了大规模流式读取而特别优化,同时也支持快速定位需要的行;列式存储使得reader可以只读取、解压和处理他们需要的值;

Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.
( In ORC, the minimum and maximum values of each column are recorded per file, per stripe (~1M rows), and every 10,000 rows. Using this information, the reader should skip any segment that could not possibly match the query predicate. 
Predicate pushdown is amazing when it works, but for a lot of data sets, it doesn‘t work at all. If the data has a large number of distinct values and is well-shuffled, the minimum and maximum stats will cover almost the entire range of values, rendering predicate pushdown ineffective. )

Predicate Pushdown谓词下推,是一个来自RDBMS的概念,即在不影响结果的前提下,尽量将过滤条件提前执行,这样可以显著减少过程中的数据量;

ORC的Predicate Pushdown使用index来判断在一个file中哪些stripes需要被读取,并且可以将查询范围缩小到10000行的集合;实现原理在ORC中,每个file/每个stripe/每10000行,都有索引会记录该数据范围内每列最大最小值等统计信息,所以可以很容易的根据查询条件判断是否需要读取相应的file/stripe/10000行;

ORC支持hive中所有的数据类型;

ORC files are divided in to stripes that are roughly 64MB by default. The stripes in a file are independent of each other and form the natural unit of distributed work. Within each stripe, the columns are separated from each other so the reader can read just the columns that are required.

ORC file会被分块成为多个stripes,每个stripes大概64M,大概100w行;一个文件中的不同stripes是相互独立的;在一个stripe中,不同的列也是分开存储的;

 

ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding – resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.

ORC使用类型相关的reader和writer来提供轻量级的压缩技术,结果是文件尺寸被极大的缩小了;除此之外,ORC还可以在内置轻量级压缩的基础上使用常用的压缩格式比如zlib、snappy等;

ORC stores the top level index at the end of the file. The overall structure of the file is given in the figure above. The file’s tail consists of 3 parts; the file metadata, file footer and postscript.
The metadata for ORC is stored using Protocol Buffers, which provides the ability to add new fields without breaking readers.

ORC在文件的末尾存储顶级index;文件末尾包含3个部分:metadata、footer、postscript;metadata是使用Protocol Buffer格式(可以增加新的列并且不影响读旧数据)存储的;

Stripes

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read.

每一个ORC file都包含多个stripes,stripes之间相互独立,可以被不同的任务并行处理;

In ORC files, each column is stored in several streams that are stored next to each other in the file.

The stripe footer contains the encoding of each column and the directory of the streams including their location.

Stripes have three sections: a set of indexes for the rows within the stripe, the data itself, and a stripe footer. Both the indexes and the data sections are divided by columns so that only the data for the required columns needs to be read.

stripes有3个部分:index集合、data、footer;其中index和data中,列和列之间都是分开存放的;

The row group indexes consist of a ROW_INDEX stream for each primitive column that has an entry for each row group. Row groups are controlled by the writer and default to 10,000 rows. Each RowIndexEntry gives the position of each stream for the column and the statistics for that row group.

row group默认是10000行;

Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards. Predicate pushdown can make use of bloom filters to better prune the row groups that do not satisfy the filter condition. A BLOOM_FILTER stream records a bloom filter entry for each row group (default to 10,000 rows) in a column. Only the row groups that satisfy min/max row index evaluation will be evaluated against the bloom filter index.

从hive1.2开始,Bloom Filter被引入,可以更好的支持Predicate Pushdown;

原文地址:https://www.cnblogs.com/barneywill/p/9923587.html

时间: 2024-08-30 11:48:36

大数据基础之ORC(1)简介的相关文章

区块链这些技术与h5房卡斗牛平台出售,大数据基础软件干货不容错过

在IT产业发展中,包括CPU.操作系统h5房卡斗牛平台出售 官网:h5.super-mans.com 企娥:2012035031 vx和tel:17061863513 h5房卡斗牛平台出售在内的基础软硬件地位独特,不但让美国赢得了产业发展的先机,成就了产业巨头,而且因为技术.标准和生态形成的壁垒,主宰了整个产业的发展.错失这几十年的发展机遇,对于企业和国家都是痛心的. 当大数据迎面而来,并有望成就一个巨大的应用和产业机会时,企业和国家都虎视眈眈,不想错再失这一难得的机遇.与传统的IT产业一样,大

大数据基础教程:创建RDD的二种方式

大数据基础教程:创建RDD的二种方式 1.从集合中创建RDD val conf = new SparkConf().setAppName("Test").setMaster("local")      val sc = new SparkContext(conf)      //这两个方法都有第二参数是一个默认值2  分片数量(partition的数量)      //scala集合通过makeRDD创建RDD,底层实现也是parallelize      val 

【原创】大数据基础之Impala(1)简介、安装、使用

impala2.12 官方:http://impala.apache.org/ 一 简介 Apache Impala is the open source, native analytic database for Apache Hadoop. Impala is shipped by Cloudera, MapR, Oracle, and Amazon. impala是hadoop上的开源分析性数据库: Do BI-style Queries on Hadoop Impala provides

大数据基础架构详解

简介:本文是对大数据领域的基础论文的阅读总结,相关论文包括GFS,MapReduce.BigTable.Chubby.SMAQ. 大数据出现的原因: 大多数的技术突破来源于实际的产品需要,大数据最初诞生于谷歌的搜索引擎中.随着web2.0时代的发展,互联网上数据量呈献爆炸式的增长,为了满足信息搜索的需要,对大规模数据的存储提出了非常强劲的需要.基于成本的考虑,通过提升硬件来解决大批量数据的搜索越来越不切实际,于是谷歌提出了一种基于软件的可靠文件存储体系GFS,使用普通的PC机来并行支撑大规模的存

大数据基础和hadoop

一.大数据的特点 大数据是什么?其实很简单,大数据其实就是海量资料巨量资料,这些巨量资料来源于世界各地随时产生的数据,在大数据时代,任何微小的数据都可能产生不可思议的价值.大数据有4个特点,为别为:Volume(大量).Variety(多样).Velocity(高速).Value(价值),一般我们称之为4V. 所谓4V,具体指如下4点: 1.大量.大数据的特征首先就体现为“大”,从先Map3时代,一个小小的MB级别的Map3就可以满足很多人的需求,然而随着时间的推移,存储单位从过去的GB到TB,

“大数据“基础知识普及

大数据,官方定义是指那些数据量特别大.数据类别特别复杂的数据集,这种数据集无法用传统的数据库进行存储,管理和处理.大数据的主要特点为数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity),合起来被称为4V. 大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用传统的关系型数据库去

图说大数据基础

大数据开发基础上之图说笔记 1.Hadoop2概览 1.1Hadoop2的组成.演化: 1.2Hadoop2.0——Hadoop1.0演化与改进: 2.HDFS系统概览 2.1HDFS系统的主要特性与适用场景: 2.2HDFS的体系结构: 2.3HDFS的构成 2.4HDFS的读流程: 2.5HDFS创建子路径流程: 2.6写流程和删除流程 3 YARN概览 3.1Hadoop1.x中的MapReduce构成及特点: 3.2 Yarn的结构图和主要组件: 3.3 YARN的工作流程图: 4 Ma

学完大数据基础,可以按照我写的顺序学下去

首先给大家介绍什么叫大数据,大数据最早是在2006年谷歌提出来的,百度给他的定义为巨量数据集合,辅相成在今天大数据技术任然随着互联网的发展,更加迅速的成长,小到个人,企业,达到国家安全,大数据的作用可见一斑,也就是近几年大数据这个概念,随着云计算的出现才凸显出其价值,云计算与大数据的关系就像硬币的正反面一样,相密不可分.但是大数据的人才缺失少之又少,这就拖延了大数据的发展.所以人才培养真的很重要. 大数据的定义.大数据,又称巨量资料,指的是所涉及的数据资料量规模巨大到无法通过人脑甚至主流软件工具

分分钟理解大数据基础之Spark

一背景 Spark 是 2010 年由 UC Berkeley AMPLab 开源的一款 基于内存的分布式计算框架,2013 年被Apache 基金会接管,是当前大数据领域最为活跃的开源项目之一 Spark 在 MapReduce 计算框架的基础上,支持计算对象数据可以直接缓存到内存中,大大提高了整体计算效率.特别适合于数据挖掘与机器学习等需要反复迭代计算的场景. 二特性 高效:Spark提供 Cache 机制,支持需要反复迭代的计算或者多次数据共享,基于Spark 的内存计算比 Hadoop