Hive文件压缩测试

hive上可以使用多种格式,比如纯文本,lzo、orc等,为了搞清楚它们之间的关系,特意做个测试。

一、建立样例表

hive> create table tbl( id int, name string ) row format delimited fields terminated by ‘|‘ stored as textfile;

OK

Time taken: 0.338 seconds

hive> load data local inpath ‘/home/grid/users.txt‘ into table tbl;

Copying data from file:/home/grid/users.txt

Copying file: file:/home/grid/users.txt

Loading data to table default.tbl

Table default.tbl stats: [numFiles=1, numRows=0, totalSize=111, rawDataSize=0]

OK

Time taken: 0.567 seconds

hive> select * from tbl;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.237 seconds, Fetched: 14 row(s)

二、测试写入

1、无压缩

hive> set hive.exec.compress.output;

hive.exec.compress.output=false

hive>

>

> create table tbltxt as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0001, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0001/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498527794024_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 10:55:29,906 Stage-1 map = 0%,  reduce = 0%

2017-06-27 10:55:39,532 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.66 sec

MapReduce Total cumulative CPU time: 2 seconds 660 msec

Ended Job = job_1498527794024_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_10-55-18_962_2187345348997213497-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbltxt

Table default.tbltxt stats: [numFiles=1, numRows=14, totalSize=111, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 2.66 sec   HDFS Read: 318 HDFS Write: 181 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 660 msec

OK

Time taken: 22.056 seconds

hive>

> show create table tbltxt;

OK

CREATE  TABLE `tbltxt`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbltxt‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘111‘,

‘transient_lastDdlTime‘=‘1498532140‘)

Time taken: 0.202 seconds, Fetched: 18 row(s)

hive>

>

> select * from tbltxt;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.059 seconds, Fetched: 14 row(s)

hive>

>

> dfs -ls /user/hive/warehouse/tbltxt;

Found 1 items

-rwxr-xr-x   1 grid supergroup        111 2017-06-27 10:55 /user/hive/warehouse/tbltxt/000000_0

hive>

>

> dfs -cat /user/hive/warehouse/tbltxt/000000_0;

1Awyp

2Azs

3Als

4Aww

5Awyp2

6Awyp3

7Awyp4

8Awyp5

9Awyp6

10Awyp7

11Awyp8

12Awyp5

13Awyp9

14Awyp20

读取和写入的格式为:

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

数据可以正常读出,数据格式为纯文本,可以直接用cat查看

2、使用压缩,格式为默认的压缩

hive>

> set hive.exec.compress.output=true;

hive>

>

> set mapred.output.compression.codec;

mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec

可见当前压缩格式为默认的DefaultCodec。

hive>

> create table tbldefault as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0002, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0002/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498527794024_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:14:44,845 Stage-1 map = 0%,  reduce = 0%

2017-06-27 11:14:48,964 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.08 sec

MapReduce Total cumulative CPU time: 1 seconds 80 msec

Ended Job = job_1498527794024_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-14-39_351_6035948930260680086-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbldefault

Table default.tbldefault stats: [numFiles=1, numRows=14, totalSize=76, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 1.08 sec   HDFS Read: 318 HDFS Write: 150 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 80 msec

OK

Time taken: 10.842 seconds

hive>

>

> show create table tbldefault;

OK

CREATE  TABLE `tbldefault`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbldefault‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘76‘,

‘transient_lastDdlTime‘=‘1498533290‘)

Time taken: 0.044 seconds, Fetched: 18 row(s)

hive>

>

> select * from tbldefault;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.037 seconds, Fetched: 14 row(s)

hive>

>

> dfs -ls /user/hive/warehouse/tbldefault;

Found 1 items

-rwxr-xr-x   1 grid supergroup         76 2017-06-27 11:14 /user/hive/warehouse/tbldefault/000000_0.deflate

hive>

> dfs -cat /user/hive/warehouse/tbldefault/000000_0.deflate;

xws

dfX0)60K:HBhive>

>

>

可见在默认压缩下,表的读写格式与txt一样,但数据文件是经过默认库压缩的,后缀名为deflate,用户无法直接查看内容。意味着org.apache.hadoop.mapred.TextInputFormat这种input可以根据后缀识别默认压缩,并读出内容。

3、lzo压缩

hive>

> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

hive>

>

> create table tbllzo as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0003, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0003/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498527794024_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:29:08,436 Stage-1 map = 0%,  reduce = 0%

2017-06-27 11:29:14,638 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.87 sec

MapReduce Total cumulative CPU time: 1 seconds 870 msec

Ended Job = job_1498527794024_0003

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-29-03_249_4340474818139134521-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbllzo

Table default.tbllzo stats: [numFiles=1, numRows=14, totalSize=106, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 1.87 sec   HDFS Read: 318 HDFS Write: 176 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 870 msec

OK

Time taken: 13.744 seconds

hive>

>

> show create table tbllzo;

OK

CREATE  TABLE `tbllzo`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbllzo‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘106‘,

‘transient_lastDdlTime‘=‘1498534156‘)

Time taken: 0.044 seconds, Fetched: 18 row(s)

hive>

> select * from tbllzo;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.032 seconds, Fetched: 14 row(s)

hive>

>

> dfs -ls /user/hive/warehouse/tbllzo;

Found 1 items

-rwxr-xr-x   1 grid supergroup        106 2017-06-27 11:29 /user/hive/warehouse/tbllzo/000000_0.lzo_deflate

hive>

>

> dfs -cat /user/hive/warehouse/tbllzo/000000_0.lzo_deflate;

ob1Awyp

2Azs

3Als

4Aww

5Awyp2

6

7

8

9

10

1

125

13Awyp9

14Awyp20

在lz压缩下,表的读写格式仍然是org.apache.hadoop.mapred.TextInputFormat,数据文件后缀名为.lzo_deflate,用户无法直接查看内容。也就是说,org.apache.hadoop.mapred.TextInputFormat这种input可以识别lzo压缩并读出内容。(真强大!)

4、lzop压缩

hive>

> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

hive>

> create table tbllzop as select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498527794024_0004, Tracking URL = http://hadoop1:8088/proxy/application_1498527794024_0004/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498527794024_0004

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-27 11:37:28,010 Stage-1 map = 0%,  reduce = 0%

2017-06-27 11:37:32,127 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.1 sec

MapReduce Total cumulative CPU time: 2 seconds 100 msec

Ended Job = job_1498527794024_0004

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-27_11-37-23_099_3493082162039010112-1/-ext-10001

Moving data to: hdfs://hadoop1:9000/user/hive/warehouse/tbllzop

Table default.tbllzop stats: [numFiles=1, numRows=14, totalSize=148, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 2.1 sec   HDFS Read: 318 HDFS Write: 219 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 100 msec

OK

Time taken: 10.233 seconds

hive>

>

> show create table tbllzop;

OK

CREATE  TABLE `tbllzop`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.mapred.TextInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbllzop‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘rawDataSize‘=‘97‘,

‘totalSize‘=‘148‘,

‘transient_lastDdlTime‘=‘1498534653‘)

Time taken: 0.046 seconds, Fetched: 18 row(s)

hive>

>

>

> select * from tbllzop;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.033 seconds, Fetched: 14 row(s)

hive>

>

> dfs -ls /user/hive/warehouse/tbllzop;

Found 1 items

-rwxr-xr-x   1 grid supergroup        148 2017-06-27 11:37 /user/hive/warehouse/tbllzop/000000_0.lzo

hive>

>

> dfs -cat /user/hive/warehouse/tbllzop/000000_0.lzo;

ob1Awyp

2Azs

3Als

4Aww

5Awyp2

6

7

8

9

10

1

125

13Awyp9

14Awyp20

同样,在lzop压缩下,表的读写格式仍然是org.apache.hadoop.mapred.TextInputFormat,数据文件后缀名为.lzo,用户无法直接查看内容。org.apache.hadoop.mapred.TextInputFormat可以识别lzop压缩并读出内容

从以上几种情况可以看出,不管使用哪种压缩,在hive看来都属于纯文本(只是使用了不同方法压缩而已),使用org.apache.hadoop.mapred.TextInputFormat都可以读取,而且hive在插入时只会根据mapred.output.compression.codec来压缩(而不会管表定义的inputFormat是什么)。以下可以验证一下:

1、set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec时插入数据,数据文件是lzop的压缩,且可以正常读出。

hive> set mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

hive>

> create table tbltest1( id int, name string )

> stored as inputformat ‘org.apache.hadoop.mapred.TextInputFormat‘

> outputformat ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘;

OK

Time taken: 0.493 seconds

hive>

> insert into table tbltest1 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0001, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0001/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498660018952_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 22:59:27,886 Stage-1 map = 0%,  reduce = 0%

2017-06-28 22:59:36,427 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.25 sec

MapReduce Total cumulative CPU time: 2 seconds 250 msec

Ended Job = job_1498660018952_0001

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_22-59-14_730_4437480099583255943-1/-ext-10000

Loading data to table default.tbltest1

Table default.tbltest1 stats: [numFiles=1, numRows=14, totalSize=148, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 2.25 sec   HDFS Read: 318 HDFS Write: 220 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 250 msec

OK

Time taken: 24.151 seconds

hive>

> dfs -ls /user/hive/warehouse/tbltest1;

Found 1 items

-rwxr-xr-x   1 grid supergroup        148 2017-06-28 22:59 /user/hive/warehouse/tbltest1/000000_0.lzo

hive>

> select * from tbltest1;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.055 seconds, Fetched: 14 row(s)

2、set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec时插入数据,数据文件是默认的压缩,且可以正常读出。

hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec;

hive> create table tbltest2( id int, name string )

> stored as inputformat ‘org.apache.hadoop.mapred.TextInputFormat‘

> outputformat ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat‘;

OK

Time taken: 0.142 seconds

hive> insert into table tbltest2 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0002, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0002/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498660018952_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 23:09:06,439 Stage-1 map = 0%,  reduce = 0%

2017-06-28 23:09:11,668 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.15 sec

MapReduce Total cumulative CPU time: 1 seconds 150 msec

Ended Job = job_1498660018952_0002

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_23-09-01_674_9172062679713398655-1/-ext-10000

Loading data to table default.tbltest2

Table default.tbltest2 stats: [numFiles=1, numRows=14, totalSize=76, rawDataSize=97]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 1.15 sec   HDFS Read: 318 HDFS Write: 148 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 150 msec

OK

Time taken: 11.278 seconds

hive>

>

>

> dfs -ls /user/hive/warehouse/tbltest2;

Found 1 items

-rwxr-xr-x   1 grid supergroup         76 2017-06-28 23:09 /user/hive/warehouse/tbltest2/000000_0.deflate

hive>

> select * from tbltest2;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.035 seconds, Fetched: 14 row(s)

3、当表是orc格式时,会按照ORC格式进行压缩,不受mapred.output.compression.codec和hive.exec.compress.output影响。

hive>  set hive.exec.compress.output=false;

hive> create table tbltest3( id int, name string )

> stored as orc tblproperties("orc.compress"="SNAPPY");

OK

Time taken: 0.08 seconds

hive>  insert into table tbltest3 select * from tbl;

Total jobs = 3

Launching Job 1 out of 3

Number of reduce tasks is set to 0 since there‘s no reduce operator

Starting Job = job_1498660018952_0003, Tracking URL = http://hadoop1:8088/proxy/application_1498660018952_0003/

Kill Command = /opt/hadoop/bin/hadoop job  -kill job_1498660018952_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

2017-06-28 23:30:29,865 Stage-1 map = 0%,  reduce = 0%

2017-06-28 23:30:34,007 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.14 sec

MapReduce Total cumulative CPU time: 1 seconds 140 msec

Ended Job = job_1498660018952_0003

Stage-4 is selected by condition resolver.

Stage-3 is filtered out by condition resolver.

Stage-5 is filtered out by condition resolver.

Moving data to: hdfs://hadoop1:9000/tmp/hive-grid/hive_2017-06-28_23-30-25_350_7458831371800658041-1/-ext-10000

Loading data to table default.tbltest3

Table default.tbltest3 stats: [numFiles=1, numRows=14, totalSize=365, rawDataSize=1288]

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1   Cumulative CPU: 1.14 sec   HDFS Read: 318 HDFS Write: 439 SUCCESS

Total MapReduce CPU Time Spent: 1 seconds 140 msec

OK

Time taken: 9.963 seconds

hive> dfs -ls /user/hive/warehouse/tbltest3;

Found 1 items

-rwxr-xr-x   1 grid supergroup        365 2017-06-28 23:30 /user/hive/warehouse/tbltest3/000000_0

hive>

> dfs -cat /user/hive/warehouse/tbltest3/000000_0;

ORC

)

9

"

A+_Az_

[email protected]+y-Az_A+_A++A+y-2345678,5A+y-9A+y-20

hive>

> show create table tbltest3;

OK

CREATE  TABLE `tbltest3`(

`id` int,

`name` string)

ROW FORMAT SERDE

‘org.apache.hadoop.hive.ql.io.orc.OrcSerde‘

STORED AS INPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat‘

OUTPUTFORMAT

‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat‘

LOCATION

‘hdfs://hadoop1:9000/user/hive/warehouse/tbltest3‘

TBLPROPERTIES (

‘COLUMN_STATS_ACCURATE‘=‘true‘,

‘numFiles‘=‘1‘,

‘numRows‘=‘14‘,

‘orc.compress‘=‘SNAPPY‘,

‘rawDataSize‘=‘1288‘,

‘totalSize‘=‘365‘,

‘transient_lastDdlTime‘=‘1498663835‘)

Time taken: 0.217 seconds, Fetched: 19 row(s)

hive>

> select * from tbltest3;

OK

1       Awyp

2       Azs

3       Als

4       Aww

5       Awyp2

6       Awyp3

7       Awyp4

8       Awyp5

9       Awyp6

10      Awyp7

11      Awyp8

12      Awyp5

13      Awyp9

14      Awyp20

Time taken: 0.689 seconds, Fetched: 14 row(s)

可见当orc格式时,插入数据并不受压缩参数的影响。而且inputformat和outputformat已经不再是text。

三、总结

1、不管是无压缩,还是默认压缩,还是lzo和lzop等格式,对hive来说都是文本格式,可以根据数据文件的后缀名自动识别,写入时根据参数决定是否压缩以及压缩成什么格式

2、orc对hive来说是另外一种格式,不管参数如何指定,都会按照建表语名指定的格式来读取和写入。

时间: 2024-11-05 01:23:23

Hive文件压缩测试的相关文章

hive文件存储格式

hive在建表是,可以通过'STORED AS FILE_FORMAT' 指定存储文件格式 例如: [plain] view plain copy > CREATE EXTERNAL TABLE MYTEST(num INT, name STRING) > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE > LOCATION '/data/test'; 指定文件存储格式为"TEXTFI

Huffman的应用之文件压缩与解压缩

文件压缩与解压缩> 最近这段时间一直在学习树的这种数据结构,也接触到了Huffman树以及了解了什仫是Huffman编码,而我们常用的zip压缩也是利用的Huffman编码的特性,那仫是不是可以自己实现一个文件压缩呢?当然可以了.在文件压缩中我实现了Huffman树和建堆Heap的代码,zip压缩的介绍> http://www.cricode.com/3481.html 下面开始介绍自己实现的文件压缩的思路和问题... 1).统计>读取一个文件统计这个文件中字符出现的次数. 2).建树&

文件压缩与解压

文件压缩 通过某种特殊的编码方式将数据信息中存在的重复度.冗余度有效地降低,从而达到数据压缩的目的.这里用的是哈夫曼树产生特殊编码. //compress.h #pragma once typedef unsigned long long longType; struct CharInfo { unsigned char _ch;//字母信息 longType _count;  //出现次数 string _code;     //哈夫曼编码 CharInfo(){} CharInfo(long

C++实现文件压缩及解压缩

原理:Huffman树的应用:Huffman编码,为出现频率较高的字符指定较短的码字,而为出现频率较低的字符指定较短的码字,可以实现二进制文件的压缩. Heap.h #pragma once #include <vector> //仿函数 template<class T> struct Lesser { bool operator()(const T& l, const T& r) { return l < r; } }; template<class

Java实现文件压缩与解压[zip格式,gzip格式]

Java实现ZIP的解压与压缩功能基本都是使用了Java的多肽和递归技术,可以对单个文件和任意级联文件夹进行压缩和解压,对于一些初学者来说是个很不错的实例. zip扮演着归档和压缩两个角色:gzip并不将文件归档,仅只是对单个文件进行压缩,所以,在UNIX平台上,命令tar通常用来创建一个档案文件,然后命令gzip来将档案文件压缩. Java I/O类库还收录了一些能读写压缩格式流的类.要想提供压缩功能,只要把它们包在已有的I/O类的外面就行了.这些类不是Reader和Writer,而是Inpu

linux下文件压缩与解压操作

对于刚刚接触Linux的人来说,一定会给Linux下一大堆各式各样的文件名给搞晕.别个不说,单单就压缩文件为例,我们知道在Windows下最常见的压缩文件就只有两种,一是,zip,另一个是.rap.可是Linux就不同了,它有.gz..tar.gz.tgz.bz2..Z..tar等众多的压缩文件名,此外windows下的.zip和.rar也可以在Linux下使用,不过在Linux使用.zip和.rar的人就太少了.本文就来对这些常见的压缩文件进行一番小结,希望你下次遇到这些文件时不至于被搞晕.

使用commons-compress操作zip文件(压缩和解压缩)

http://www.cnblogs.com/luxh/archive/2012/06/28/2568758.html Apache Commons Compress是一个压缩.解压缩文件的类库. 可以操作ar, cpio, Unix dump, tar, zip, gzip, XZ, Pack200 and bzip2格式的文件,功能比较强大. 在这里写两个用Commons Compress把文件压缩成zip和从zip解压缩的方法. 直接贴上工具类代码: /** * Zip文件工具类 * @a

Linux命令 文件压缩及压缩命令

gzip [功能说明] 文件的压缩 #gizp属于GNU软件,总性能不错,是Linux系统首选的压缩工具,tar归档命令的-z参数也是利用gzip/gunzip来解压缩 [语法格式] Gip[选项][文件或目录] [选项参数] 参数 说明 -c 将压缩/解压的内容输出的到设备上,并保留源文件 -d 将压缩文件解压 -l 如果目标文件是压缩文件,将显示压缩和未压缩的大小,压缩比,未压缩文件的名称 -t 测试并检查压缩文件的完整性 -r 若指定文件是压缩文件,将递归查找指定目录并压缩其中的所有文件

实现asp.net的文件压缩、解压、下载

很早前就想做文件的解压.压缩.下载 了,不过一直没时间,现在项目做完了,今天弄了下.不过解压,压缩的方法还是看的网上的,嘻嘻~~不过我把它们综合了一下哦.呵呵~~ 1.先要从网上下载一个icsharpcode.sharpziplib.dll 2.建立类AttachmentUnZip,内容如下: using System;using System.Data;using System.Configuration;using System.Web;using System.Web.Security;u