<Using parquet with impala>

Operations upon Impala

Create table

  • stored as parquet
  • like parquet ‘/user/etl/datafile1‘ stored as parquet

Loading data

  • shuffle / no shuffle to choose
  • 使用insert ... select 而不是 insert ... values, 因为后者产生a separate tiny data file.
  • impala decodes the column data in Parquet files based on the ordinal position of the columns.
  • loading data into parquet tables is a memory-intensive operation --> buffered until it reaches one data block in size, then organized and compressed in memory before written out.

Query performance

  • Query performance for Parquet tables depends on the number of columns needed to process the SELECT list and WHERE clauses of the query, the way data is divided into large data files with block size equal to file size, the reduction in I/O by reading the data for each column in compressed format, which data files can be skipped(for partitioned tables), and the CPU overhead of decompressing the data for each column. [查询性能取决于select & where子句中的列数目,数据切分成文件的方式,读取压缩形式的每一列的I/O,哪些数据文件可以被跳过(对于partitioned tables),解压缩每一列的CPU负担]
  • An efficient query for a Parquet table:
    • select avg(income) from census_data where state = ‘CA‘;
    • 只处理两列;
    • 如果表是通过state列partitioned的,这将会更高效。因为该语句对每个数据文件只需要读取并解码一列,并且只需要读取在‘CA‘文件目录下的文件
  • An inefficient query:
    • select * from census_data;
    • Impala have to read the entire contents, decompress each column for each row group;
    • Does not take advantage of unique strengths for Parquet data files
  • Impala can optimize queries on Parquet tables, especially join queries, better when statistics are available for all the tables. See COMPUTE STATS Statement following.
  • 以上,高效地查询可以通过这些方式:
    • 查询语句方面:select & where从句只操作必须的列;尽量使用分区的列来减少所需查找的文件。
    • 文件方面:合适的block size & 压缩方式;使用分区。

Partitioning for Parquet tables

  • Partitioning: An important performance technique.
  • The Parquet file format is ideal for tables containing many columns, where most queries only refer to a small subset of the columns.[Parquet文件格式适合含有很多列,并且大多数查询只涉及到部分列的场景。]
  • By using WHERE clause that refer to the partition key columns, impala can skip the data files for certain partitions entirely.
  • Inserting into a partitioned Parquet table can be a resource-intensive operation.

Snappy and GZip Compression

  • using the INSERT statement with COMPRESSION_CODEC query option to control the underlying compression —> snappy (the default), gzip, and none.
  • TBD...

For Impala Complex Types

  • Impala supports the complex types ARRAY, STRUCT, and MAP; See Complex Types following;

How Parquet data files are organized

  • although column-oriented file format, Parquet keeps all the data for a row within the same data file, to ensure that the columns for a row are always available on the same node for processing.[Parquet是面向列的文件格式,但是Parquet将一行中的所有数据保存在同一数据文件中,来确保一行中的所有列能在同一节点中被处理]。
  • Within that data file, the data for a set of rows is rearranged so that all the values from the first column are organized in one contiguous block, then all the values from the second column, and so on. [在数据文件内部,就是按列式存储方式存储的啦]

RLE and Dictionary Encoding

  • Paquet uses some automatic compression techniques, such as run-length encoding (RLE) and dictionary encoding, based on analysis of the actual data values.[Parquet会根据实际数据值使用一些自动的压缩技术,比如游程长度编码和字典编码]
  • RLE: condenses sequences of repeated data values. For example, if many consecutive rows all contain the same value for a country code, those repeating values can be represented by the value followed by a count of how many times it appears consecutively. [RLE实际上就是将连续的相同值压缩表示成值+个数]
  • Dictionary encoding:...

COMPUTE STATS Statement

  • Gathers information about volume and distribution of data in a table and all associated columns and partitions. [通过compute语句收集表以及所有相关的列和分区的体积和分布信息。]
  • The information is stored in the metastore database, and used by Impala to help optimize queries. [impala可以根据收集到的信息优化查询。]
    • For example, if Impala can determine that a table is large or small, or has many or few distinct values it can organize parallelize the work appropriately for a join query or insert operation. [比如,如果impala知道表的大小,或者是否有很多不同值,它就可以为join查询以及insert操作组织合适的并行度]
  • COMPUTE STATS
  • TBD...

Complext Types

  • Complex types (also referred to as nested types) let you represent multiple data values within a single row/column position.

Benefits

  • The reasons for using it:

    • already have data produced by Hive or other non-Impala component that uses the complex type column names;
    • Your data model originates with a non-SQL programming language or a NoSQL data management system. For example, if you are representing Python data expressed as nested lists, dictionaries, and tuples, those data structures correspond closely to Impala ARRAY, MAP, and STRUCT types;
    • Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional Hadoop shuffle or broadcast join techniques. [包含多个表的分析查询可以受益于join操作中的局部性。通过将很多相关的数据项packing到一个HDFS数据块中,复杂类型减轻了join带来的网络负担。]

Overview

  • Array, map, struct

    • The ARRAY and MAP types are closely related: they represent collections with arbitrary numbers of elements, where each element is the same type. 【key of map is not necessarily unique.】
    • STRUCT groups together a fixed number of items into a single element.
  • The elements of an ARRAY or MAP, or the fields of a STRUCT, can also be other complex types. —> nested
  • when visualizing ur data model in familiar SQL terms, u can think of each ARRAY or MAP as a miniature table, and each STRUCT as a row within such a table.
  • By default, the table represented by an ARRAY has two columns, POS to represent ordering of elements, and ITEM representing the value of each element. Likewise, by default, the table represented by a MAP encodes key-value pairs, and therefore has two columns, KEY and VALUE.

Design Consideration

How complex types differ from traditional data warehouse schemas

  • In traditional data warehousing, related values were typically arranged in one of two ways: [传统的数据仓库中,related values有两种组织方式]

    • Split across normalized tables, using foreign key columns. This arrangement avoided duplicate data and therefore the data wad compact. But join queries could be expensive because the related data had to be retrieved from separate locations. [分隔成多个多个normalized表,通过外键关联到一起。这种方式避免了重复数据,因为表是紧凑的。但是join查询是昂贵的,因为相关数据需要从不同的位置取得。]
    • Flattened into a single denormalized table. This removing the need for join queries, but values were repeated.The extra data volume could cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans during queries.[扁平化成一个denormalized表。这种方式避免了join查询,但是数据重复。这种重复还会造成其他情况下的性能问题,比如更昂贵的全表扫描操作。]
  • Complex types represent a middle ground that addresses these performance and volume concerns. [复杂类型是一种性能和体积之间的折中方式。]
  • ...

Using complex types from SQL

DDL Statements

  • create table contacts_array_of_phones
    (
        id BIGINT,
        name STRING,
        address STRING,
        phone_number ARRAY<STRING>
    ) stored as parquet;
  • create table contacts_unlimited_phones
    (
        id BIGINT,
        name STRING,
        address STRING,
        phone_number MAP<STRING, STRING>
    ) stored as PARQUET;

SQL statements that support complex types

  • Currently: CREATE TABLE, ALTER TABLE, DESCRIBE, LOAD DATA, and SELECT.
  • The result set of an Impala query always contains all scalar types; the elements and fields within any complex type queries must be "unpacked" using join queries. —> A query cannot directly retrieve the entire value for a complex type column. [复杂类型内的字段必须先通过join查询被‘unpacked‘。也就是查询不能直接取得复杂类型的值。]
    • select c_orders from customer limit 1;
      
      ERROR: AnalysisException: Expr ‘c_orders‘ in select list returns a complex type ‘ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>‘.
      Only scalar types are allowed in the select list.
    • --- only scalar in select, and add region.r_nations in from
      select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
  • select * only retrieves scalar columns. [select * 语句只会返回所有的scalar列]
  • the following queries work equivalently. They each return customer and order data for customers that have at least one order. [以下两个查询是等价的。它们只会返回至少有一个订单的顾客]
    • select c.c_name, o.o_orderkey
      from customer c, c.c_orders o limit 5;
      +--------------------+------------+
      | c_name             | o_orderkey |
      +--------------------+------------+
      | Customer#000072578 | 558821     |
      | Customer#000072578 | 2079810    |
      | Customer#000072578 | 5768068    |
      | Customer#000072578 | 1805604    |
      | Customer#000072578 | 3436389    |
      +--------------------+------------+
    • select c.c_name, o.o_orderkey
      from customer c
      inner join c.c_orders o
      limit 5;
      
      +--------------------+------------+
      | c_name             | o_orderkey |
      +--------------------+------------+
      | Customer#000072578 | 558821     |
      | Customer#000072578 | 2079810    |
      | Customer#000072578 | 5768068    |
      | Customer#000072578 | 1805604    |
      | Customer#000072578 | 3436389    |
      +--------------------+------------+
  • The following query using an outer join returns customers that have orders, plus customers with no orders (no entries in the C_ORDERS array): [下面的查询使用outer join,会返回有订单和没有订单(在C_ORDERS数组下没有entries)的顾客。]
    • select c.c_custkey, o.o_orderkey
      from customer c left outer join c.c_orders o
      limit 5;
      
      +-----------+------------+
      | c_custkey | o_orderkey |
      +-----------+------------+
      | 60210     | NULL       |
      | 147873    | NULL       |
      | 72578     | 558821     |
      | 72578     | 2079810    |
      | 72578     | 5768068    |
      +-----------+------------+
  • Correlated subqueries. Note the correlated reference to the table alias C. The COUNT(*)operation applies to all the elements of the C_ORDERS array for the corresponding row, avoiding the need for a GROUP BY clause. [仔细看,聚集在c的每一行(也就是每个customer)] [相关子查询:count(*)操作作用在相应行的c_orders数组上,避免了额外的group by子句。]
    • select c_name, howmany
      from customer c,
          (select count(*) from c.c_orders) v
      limit 5;
      
      +--------------------+---------+
      | c_name             | howmany |
      +--------------------+---------+
      | Customer#000030065 | 15      |
      | Customer#000065455 | 18      |
      | Customer#000113644 | 21      |
      | Customer#000111078 | 0       |
      | Customer#000024621 | 0       |
      +--------------------+---------+
      
时间: 2024-10-19 21:05:46

<Using parquet with impala>的相关文章

CI框架源码阅读笔记3 全局函数Common.php

从本篇开始,将深入CI框架的内部,一步步去探索这个框架的实现.结构和设计. Common.php文件定义了一系列的全局函数(一般来说,全局函数具有最高的加载优先权,因此大多数的框架中BootStrap引导文件都会最先引入全局函数,以便于之后的处理工作). 打开Common.php中,第一行代码就非常诡异: if ( ! defined('BASEPATH')) exit('No direct script access allowed'); 上一篇(CI框架源码阅读笔记2 一切的入口 index

IOS测试框架之:athrun的InstrumentDriver源码阅读笔记

athrun的InstrumentDriver源码阅读笔记 作者:唯一 athrun是淘宝的开源测试项目,InstrumentDriver是ios端的实现,之前在公司项目中用过这个框架,没有深入了解,现在回来记录下. 官方介绍:http://code.taobao.org/p/athrun/wiki/instrumentDriver/ 优点:这个框架是对UIAutomation的java实现,在代码提示.用例维护方面比UIAutomation强多了,借junit4的光,我们可以通过junit4的

Yii源码阅读笔记 - 日志组件

?使用 Yii框架为开发者提供两个静态方法进行日志记录: Yii::log($message, $level, $category);Yii::trace($message, $category); 两者的区别在于后者依赖于应用开启调试模式,即定义常量YII_DEBUG: defined('YII_DEBUG') or define('YII_DEBUG', true); Yii::log方法的调用需要指定message的level和category.category是格式为“xxx.yyy.z

源码阅读笔记 - 1 MSVC2015中的std::sort

大约寒假开始的时候我就已经把std::sort的源码阅读完毕并理解其中的做法了,到了寒假结尾,姑且把它写出来 这是我的第一篇源码阅读笔记,以后会发更多的,包括算法和库实现,源码会按照我自己的代码风格格式化,去掉或者展开用于条件编译或者debug检查的宏,依重要程度重新排序函数,但是不会改变命名方式(虽然MSVC的STL命名实在是我不能接受的那种),对于代码块的解释会在代码块前(上面)用注释标明. template<class _RanIt, class _Diff, class _Pr> in

CI框架源码阅读笔记5 基准测试 BenchMark.php

上一篇博客(CI框架源码阅读笔记4 引导文件CodeIgniter.php)中,我们已经看到:CI中核心流程的核心功能都是由不同的组件来完成的.这些组件类似于一个一个单独的模块,不同的模块完成不同的功能,各模块之间可以相互调用,共同构成了CI的核心骨架. 从本篇开始,将进一步去分析各组件的实现细节,深入CI核心的黑盒内部(研究之后,其实就应该是白盒了,仅仅对于应用来说,它应该算是黑盒),从而更好的去认识.把握这个框架. 按照惯例,在开始之前,我们贴上CI中不完全的核心组件图: 由于BenchMa

CI框架源码阅读笔记2 一切的入口 index.php

上一节(CI框架源码阅读笔记1 - 环境准备.基本术语和框架流程)中,我们提到了CI框架的基本流程,这里这次贴出流程图,以备参考: 作为CI框架的入口文件,源码阅读,自然由此开始.在源码阅读的过程中,我们并不会逐行进行解释,而只解释核心的功能和实现. 1.       设置应用程序环境 define('ENVIRONMENT', 'development'); 这里的development可以是任何你喜欢的环境名称(比如dev,再如test),相对应的,你要在下面的switch case代码块中

Apache Storm源码阅读笔记

欢迎转载,转载请注明出处. 楔子 自从建了Spark交流的QQ群之后,热情加入的同学不少,大家不仅对Spark很热衷对于Storm也是充满好奇.大家都提到一个问题就是有关storm内部实现机理的资料比较少,理解起来非常费劲. 尽管自己也陆续对storm的源码走读发表了一些博文,当时写的时候比较匆忙,有时候衔接的不是太好,此番做了一些整理,主要是针对TridentTopology部分,修改过的内容采用pdf格式发布,方便打印. 文章中有些内容的理解得益于徐明明和fxjwind两位的指点,非常感谢.

CI框架源码阅读笔记4 引导文件CodeIgniter.php

到了这里,终于进入CI框架的核心了.既然是"引导"文件,那么就是对用户的请求.参数等做相应的导向,让用户请求和数据流按照正确的线路各就各位.例如,用户的请求url: http://you.host.com/usr/reg 经过引导文件,实际上会交给Application中的UsrController控制器的reg方法去处理. 这之中,CodeIgniter.php做了哪些工作?我们一步步来看. 1.    导入预定义常量.框架环境初始化 之前的一篇博客(CI框架源码阅读笔记2 一切的入

jdk源码阅读笔记之java集合框架(二)(ArrayList)

关于ArrayList的分析,会从且仅从其添加(add)与删除(remove)方法入手. ArrayList类定义: p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; font: 18.0px Monaco } span.s1 { color: #931a68 } public class ArrayList<E> extends AbstractList<E> implements List<E> ArrayList基本属性: /** *

dubbo源码阅读笔记--服务调用时序

上接dubbo源码阅读笔记--暴露服务时序,继续梳理服务调用时序,下图右面红线流程. 整理了调用时序图 分为3步,connect,decode,invoke. 连接 AllChannelHandler.connected(Channel) line: 38 HeartbeatHandler.connected(Channel) line: 47 MultiMessageHandler(AbstractChannelHandlerDelegate).connected(Channel) line: