Understanding Cubert Concepts 之 BLOCK(一)

Understanding Cubert Concepts:

Cubert Concepts

对于Cubert,我们要理解其核心的一些概念,比如BLOCK。这些概念也是区别于传统的关系型范式(Pig,Hive)等数据处理流程并使得Cubert在大规模数据下JOIN和Aggregation中取胜的关键因素。(自己测下来,CUBE的计算效率比Hive高好多倍。)

BLOCK

Cubert定义了一个BLOCK的概念,分为两种:Partitioned Blocks & Co-ParitionedBlocks

Hubert将这些Block存储为特殊的格式,叫做Rubix Format

Partitioned Blocks

从字面上来看,叫做分区块。

比如说有一个pageviews表,有三个列,分别为:memberId(int),pagekey(string),timestamp(long)

通常在HDFS中,这些数据会被切分为一个个的文件(part-00000.avro, part-00001.avro, etc),然后置于某一个目录下,这些数据默认是没有被分区排序的。

然而,在Cubert的世界里,我们鼓励数据能被更加结构化的存储。

更确切的来说,我们希望数据能够根据一些分区键来进行分区成一些数据单元,这些数据单元就是Cubert中的Partitioned Blocks, 而且我们希望在每个Block中的数据能够在某些列上是有序的。

PS:这里面涉及到2个概念:PartitionKeysSortKeys,对应于上述的分区键排序键

BLOCKGEN

Raw data转化为partitionedsorteddata units的过程称为BLOCKGEN。这个是Cubert语法里一个非常重要的操作符。

这张图告诉我们:

1. 我们有一个table,2列,JKGK

2. BLOCKGEN的过程就是选择一个partitionKeyJK,根据这个分区键来对数据块分区。然后对分区后的数据块内部选择GK作为排序键,来对分区后的数据块排序。

3. 这样原始数据划分称为了2个partitionedBlocks即BLOCK#1BLOCK#2

BLOCKGEN Checklist

作为一个Cubert的开发者,我们需要遵从4个规范:

1.定义PartitionKeys

从这个数据集的列中选择要根据哪几个列进行分区。

举个来说:

对于pageviews这个表:

如果指定分区键为memberId,那么我们可以确定的是,所有memberId=1234数据Row都会被分区到一个partitionedBlock中去.

2.定义SortKeys(可选)

从这个数据集的列中选择要根据哪几个列进行排序,如果不指定,默认和分区键相同。

Note:这个排序操作不是全局排序,只是在每个已经分区好的block内部进行局部排序。

举个来说:

还是pageviews这个表:

我们分区后的数据,可以根据timestamp这个时间字段,在对block内部rows进行排序。

3.定义代价函数CostFunction

前面一直提到分区,具体如何来划分block呢?这时候cost function起到了作用:

  • BY ROW 根据数据行数来划分,每个block中最多油多少行记录。如果超出阀值,则新生成一个block
  • BY PARTITION KEYS 根据分区键来划分,每个block要有指定数目的partition keys。如果partition keys是主键的话,那么和BY ROW这个cost function效果类似。
  • BY SIZE 根据数据块的大小来划分,单位bytes。超过指定阀值,就会新建一个block。
4.存储结果数据格式(必须)为RUBIX格式

RUBIX是一种特殊的数据格式,它存储了数据的一些索引细信息BLOCKGEN过程需要的一些metadata

Creating Partitioned Blocks(Demo)

Note: BLOCKGEN是一个shuffle command

该程序的分区键:memberId

排序键:timestamp

JOB "our first BLOCKGEN"
        REDUCERS 10;
        MAP {
                data = LOAD "/path/to/data" USING AVRO();
        }
        // Create blocks that are (a) partitioned on memberId, (b) sorted on timestamp, and
        // (c) have a size of 1000 rows
        BLOCKGEN data BY ROW 1000 PARTITIONED ON memberId SORTED ON timestamp;

        // ALWAYS store BLOCKGEN data using RUBIX file format!
        STORE data INTO "/path/to/output" USING RUBIX();
END

由于我们设定了reducer的个数为10,那么将会有10个part-xxx.rbx文件,e.g.:(part-r-00000.rbx through part-r-00009.rbx

Note:每个rbx文件中可以包含>=1block。所以不用担心会生产太多的file.

参考

Cubert官方文档blocks

Ps:本文的写作是基于对Cubert官方文档的翻译个人对Cubert的理解综合完成 :)

原创文章,转载请注明:

转载自:OopsOutOfMemory盛利的Blog, 作者: OopsOutOfMemory

本文链接地址:

注:本文基于署名-非商业性使用-禁止演绎 2.5 中国大陆(CC BY-NC-ND 2.5 CN)协议,欢迎转载、转发和评论,但是请保留本文作者署名和文章链接。如若需要用于商业目的或者与授权方面的协商,请联系我。

版权声明:本文为博主原创文章,未经博主允许不得转载。

时间: 2024-10-06 16:42:57

Understanding Cubert Concepts 之 BLOCK(一)的相关文章

Understanding Cubert Concepts(二)Co-Partitioned Blocks

Understanding Cubert Concepts(二):Cubert Co-Partitioned Blocks 话接上文Cubert PartitionedBlocks,我们介绍了Cubert的核心Block概念之一的分区块,它是一种根据partitionKeys和cost function来对原始数据进行Redistribution和Transformation来结构化数据,这种结构化的数据是对后续join和cube计算是非常有利的. 好了,本文将着重讲Cubert Block中的

Basic Concepts of Block Media Recovery

Basic Concepts of Block Media Recovery Whenever block corruption has been automatically detected, you can perform block media recovery manually with the RECOVER ... BLOCK command. By default, RMAN first searches for good blocks in the real-time query

LinkedIn Cubert 实践指南

· LinkedIn Cubert安装指南 · Understanding Cubert Concepts(一)Partitioned Blocks · Understanding Cubert Concepts(二)Co-Partitioned Blocks 原创文章,转载请注明: 转载自:OopsOutOfMemory盛利的Blog,作者: OopsOutOfMemory 本文链接地址:http://blog.csdn.net/oopsoom/article/details/46707733

Basic Concepts 基本概念(二)

Basic Concepts There are a few concepts that are core to Elasticsearch. Understanding these concepts from the outset will tremendously help ease the learning process. 有一些概念是Elasticsearch的核心.从一开始就理解这些概念将极大地帮助简化学习过程. Near Realtime (NRT) Elasticsearch i

Application binary interface and method of interfacing binary application program to digital computer

An application binary interface includes linkage structures for interfacing a binary application program to a digital computer. A function in a relocatable shared object module obtains the absolute address of a Global Offset Table (GOT) in the module

Log4j – Configuring Log4j 2 - Apache Log4j 2

Apache Log4j 2 ? Logging Wiki Apache Logging Services Sonar   Configuration Inserting log requests into the application code requires a fair amount of planning and effort. Observation shows that approximately 4 percent of code is dedicated to logging

CISSP AIO 2th: Information Security Governance and Risk Management

2.11 Security Steering Committee(安全指导委员会) A security steering committee is responsible for making decisions on tactical and strategic security issues within the enterprise as a whole and should not be tied to one or more business units. The group sho

spring Transaction Management --官方

原文链接:http://docs.spring.io/spring/docs/current/spring-framework-reference/html/transaction.html 12. Transaction Management 12.1 Introduction to Spring Framework transaction management Comprehensive transaction support is among the most compelling rea

Measure Java Performance – Sampling or Instrumentation

copy from https://blog.codecentric.de/en/2011/10/measure-java-performance-sampling-or-instrumentation/ In recent discussions, I noticed a confusion about the differences between measuring with Sampling andInstrumentation.I hear about which one should