hive重写分区数据异常

问题描述：

已有(外部/内部)表test，新建分区时指定数据位置，如下

alter table test add partition(day=‘20140101‘)

location ‘20140101‘;

这样会默认在表warehouse路径下生成/{warehouse}/test/20140101/这种格式目录

同时使用命令 desc formatted test partition(day=‘20140101‘)可以查看到相应的location为

hdfs://..:../{warehouse}/test/20140101/

然后使用insert overwrite向分区插入数据

insert overwrite table test partition (day=‘20140101‘)

select xx from xx....;

正常情况下一切正常，但是当设置属性fs.hdfs.impl.disable.cache为true时，会出现以下情况

desc formatted test partition(day=‘20140101‘)时发现location变成了以下格式

hdfs://..:../{warehouse}/test/day=20140101/

同时会在hdfs上生成一个新的目录/{warehouse}/test/day=20140101/,而此分区之前的location路径会被删掉，即/{warehouse}/test/20140101/这个路径被删除

解决：

（1）先看hql的执行计划，大概如下

Stage: Stage-1

Map Reduce

Alias -> Map Operator Tree:

dual

TableScan

alias: dual

Select Operator

expressions:

expr: ‘1‘

type: string

expr: ‘2‘

type: string

outputColumnNames: _col0, _col1

File Output Operator

compressed: true

GlobalTableId: 1

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: default.test

Stage: Stage-4

Move Operator

files:

hdfs directory: true

destination: hdfs://hadoop_namenode/tmp/hive-root/hive_2015-01-07_18-07-13_120_2026314954951095577/-ext-10000

Stage: Stage-0

Move Operator

tables:

partition:

day 20140101

replace: true --overwrite

table:

input format: org.apache.hadoop.mapred.TextInputFormat

output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

name: default.test

由执行计划能看出，前面的mapreduce过程不会影响到表分区路径的新建或删除，而真正影响到数据的操作是Move Operator

（2）找到Move Operator对应的源码task类， org.apache.hadoop.hive.ql.exec.MoveTask.java

该类有个方法是move操作时执行的，public int execute(DriverContext driverContext) {...}方法较长，我们只找主要执行到的部分

// static partitions 对于静态分区，主要有以下方法执行操作

db.loadPartition(tbd.getSourcePath(), //来源数据的位置，即mapreduce计算结果的临时目录

tbd.getTable() .getTableName(), //表的名字

tbd.getPartitionSpec(), //获得指定的表分区

tbd .getReplace(), //是否采用覆盖的方式，overwrite

tbd.getHoldDDLTime(), //if true, force [re]create the partition，没有分区则新建分区

tbd .getInheritTableSpecs(), //修改的分区是否继承之前的属性，默认为true

isSkewedStoredAsDirs(tbd)); //表是否是分桶表

（3）跟踪方法进入到org.apache.hadoop.hive.ql.metadata.Hive.java类，找到对应的方法loadPartition，找到相应代码段

Partition oldPart = getPartition(tbl, partSpec, false);

Path oldPartPath = null;

if(oldPart != null) {

oldPartPath = oldPart.getDataLocation(); //表分区定义的location，即我们例子中的 /{warehouse}/test/20140101/

}

Path newPartPath = null;

if (inheritTableSpecs) {//默认值为true

Path partPath = new Path(tbl.getDataLocation(),Warehouse.makePartPath(partSpec));

newPartPath = new Path(loadPath.toUri().getScheme(), loadPath.toUri().getAuthority(),

partPath.toUri().getPath());//值为由表的location信息和分区值组成的路径，即我们例子中的/{warehouse}/test/day=20140101/

if(oldPart != null) {

/*

* If we are moving the partition across filesystem boundaries

* inherit from the table properties. Otherwise (same filesystem) use the

* original partition location.

*

* See: HIVE-1707 and HIVE-2117 for background

*/

/*fs.hdfs.impl.disable.cache 这个参数就影响到以下两个操作，决定了oldPartPathFS与loadPathFS 是否指向同一个对象，进而影响到 newPartPath
的值到底取什么

*/

FileSystem oldPartPathFS = oldPartPath.getFileSystem(getConf());//分区的location

FileSystem loadPathFS = loadPath.getFileSystem(getConf());//来源数据

if (oldPartPathFS.equals(loadPathFS)) {

newPartPath = oldPartPath;

}

}else {

newPartPath = oldPartPath;

}

newPartPath 这个变量就是决定数据move操作时的目的路径，所以只要确定newPartPath 的值，我们就知道数据是怎么移动的

（4）目标路径的取值

我们嵌入一下org.apache.hadoop.fs.Path.java类的内容，找到方法getFileSystem（Configuration），研究一下这个方法是怎么实现的

public FileSystem getFileSystem(Configuration conf)

throws IOException

{

return FileSystem.get(toUri(), conf);

}

继续跟踪代码FileSystem.get(toUri(), conf)，跟到类org.apache.hadoop.fs.FileSystem.java，跟踪方法public static FileSystem get(URI uri,
Configuration conf){...}，看主要代码段：

String disableCacheName = String.format("fs.%s.impl.disable.cache", new Object[] { scheme });

if(conf.getBoolean(disableCacheName, false))

return createFileSystem(uri, conf); //如果设置了fs.hdfs.impl.disable.cache=true，则每次FileSystem.get(...)时，都是获得一个

新FileSystem对象，再执行上面的oldPartPathFS.equals(loadPathFS)时，肯定为false

else

return CACHE.get(uri, conf);//如果设置了fs.hdfs.impl.disable.cache=false，则从缓存CACHE中找相应的FileSystem对象，再执行上面

的oldPartPathFS.equals(loadPathFS)时，为true

根据这段的分析，再执行（3）中的代码时，如下

if (oldPartPathFS.equals(loadPathFS)) {

newPartPath = oldPartPath;

}

//如果设置了fs.hdfs.impl.disable.cache=false，则oldPartPathFS.equals(loadPathFS)返回true，newPartPath 取值为oldPartPath，值为上例中的 /{warehouse}/test/20140101/；否则newPartPath 的值保持不变，为/{warehouse}/test/day=20140101/

由于我们在操作中设置了fs.hdfs.impl.disable.cache=true，所以导致newPartPath 值为/{warehouse}/test/day=20140101/

（5）移动数据，回到类org.apache.hadoop.hive.ql.metadata.Hive.java

/* 由于我们使用的操作是insert overwrite ，所以 replace为true，最终数据就是移动到newPartPath*/

if (replace) { // 判断是否替换掉原来的数据

Hive.replaceFiles(loadPath, newPartPath, oldPartPath, getConf());

} else {

FileSystem fs = tbl.getDataLocation().getFileSystem(conf);

Hive.copyFiles(conf, loadPath, newPartPath, fs);

}

跟踪到方法 void replaceFiles(Path srcf, Path destf, Path oldPath, HiveConf conf){...}，看下对数据的操作

这个方法主要有两个操作

1.删除原来的数据 , oldPath ,即我们例子中的/{warehouse}/test/day=20140101/

if (fs2.exists(oldPath)) {

// use FsShell to move data to .Trash first rather than delete permanently

FsShell fshell = new FsShell();

fshell.setConf(conf);

fshell.run(new String[]{"-rmr", oldPath.toString()});

}

2.rename源数据到目标路径，完成数据移动，srcf->destf，上例中此时的destf为/{warehouse}/test/day=20140101/

boolean b = renameFile(conf, srcs[0].getPath(), destf, fs, true);

根据上面的分析，我们可以看出，由于设置了fs.hdfs.impl.disable.cache=true,，无法再缓存中取FileSystem对象，所以导致newPartPath的值无法取到oldPartPath的值，最终为/{warehouse}/test/day=20140101/，所以最终会在hdfs上面新建一个目录，然后删除了oldPartPath原来的数据，导致/{warehouse}/test/20140101/目录及下面的文件都被删除掉，所以出现了上面的情况！

时间： 2024-12-05 16:10:38

hive重写分区数据异常

hive重写分区数据异常的相关文章

在Impala 和Hive里进行数据分区（1）

Hive[5] HiveQL 数据操作

Hive基础之Hive体系架构&运行模式&Hive与关系型数据的区别

Hive之分区（Partitions）和桶（Buckets）

hive学习笔记-数据操作

框架 day50 BOS项目 4 批量导入(ocupload插件,pinyin4J)/POI解析Excel/Combobox下拉框/分区组合条件分页查询(ajax)/分区数据导出(Excel)

HIVE动态分区实战

HIVE几种数据导入方式

hive创建分区