关于hive RegexSerDe的源码分析

最近有个业务建表使用了 RegexSerDe，之前虽然也它来解析nginx日志，但是没有做深入的了解。这次看了下其实现方式。

建表语句：

CREATE external TABLE ods_cart_log
(
time_local STRING,
request_json  STRING,
trace_id_num STRING
)
PARTITIONED BY
(
dt string,
hour string
)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2.RegexSerDe‘
WITH SERDEPROPERTIES
("input.regex" =
"\\\[(.*?)\\\] .*\\\|(.*?) (.*?) \\\[(.*?)\\\]",
"output.format.string" ="%1$s %2$s  %4$s")
STORED AS TEXTFILE;

测试数据：

[2014-07-24 15:54:54] [6] OperationData.php: 
:89|{"action":"add","redis_key_hash":9,"time":"1406188494.73745500","source":"web",
"mars_cid":"","session_id":"","info":{"cart_id":26885,"user_id":4,"size_id":"2784145",
"num":"1","warehouse":"VIP_NH","brand_id":"7379","cart_record_id":26885,"channel":"te"}}
 trace_id [40618849399972881308]

这里trace_id_num按照猜想应该是第4个字段（即40618849399972881308），但是实际输出了第3个字段（trace_id）

查看其代码实现：

RegexSerDe主要由下面三个参数：

1）input.regex 正则

2）output.format.string 输出格式

3）input.regex.case.insensitive 大小写是否敏感

其中input.regex用在反序列化方法中，即数据的读取（hive读取hdfs文件），相对的output.format.string 用在序列化的方法中，即数据的写入（hive写入hdfs文件）。

在反序列化的方法deserialize中有如下代码，用于返回代表匹配字段的数据：

   for (int c = 0; c < numColumns; c++) {   //numColumns是按表中column的数量算的（
   比如这个例子columnNames 是[time_local, request_json, trace_id_num]   | numColumns = columnNames.size();
      try {
        row.set(c, m.group(c + 1));  //可以看到字段的匹配从0开始，中间不会有跳跃，
        所以这里select  trace_id_num 字段是正则里面的第3个组，而和output.format.string没有关系
          } catch (RuntimeException e) {
        partialMatchedRows++;
        if (partialMatchedRows >= nextPartialMatchedRows) {
          nextPartialMatchedRows = getNextNumberToDisplay(nextPartialMatchedRows);
          // Report the row
          LOG.warn("" + partialMatchedRows
              + " partially unmatched rows are found, " + " cannot find group "
              + c + ": " + rowText);
        }
        row.set(c, null);
      }
    }

这里output.format.string的设置仔细想想貌似没什么用，首先RegexSerDe的方式只在textfile下生效，即可以用load向hive的表中导入数据，但是load是一个hdfs层面的文件操作，不涉及到序列化，如果想使用序列化，需要使用insert into select的方式插入数据，但是这种方式插入的数据又和select的数据有关系，和output.format.string没什么关系了。。

其实regexserde类有两个

分别位于

./serde/src/java/org/apache/hadoop/hive/serde2/RegexSerDe.java 和

./contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java

都是扩展了AbstractSerDe这个抽象类。通过代码可以看到contrib下的这个类是实现了serialize 和 deserialize 方法，而上面这个只实现了deserialize 方法，由此看来RegexSerDe中的serialize 方法可能是没什么用的。。

另外需要注意几点：

1.如果一行匹配不上，整个行的字段输出都是null

 if (!m.matches()) {
      unmatchedRows++;
      if (unmatchedRows >= nextUnmatchedRows) {
        nextUnmatchedRows = getNextNumberToDisplay(nextUnmatchedRows);
        // Report the row
        LOG.warn("" + unmatchedRows + " unmatched rows are found: " + rowText);
      }
      return null;
    }

2.表的字段类型必须都是string，否则会报错,如果需要别的字段，可以在select中使用cast做转换

    for ( int c = 0; c < numColumns ; c++) {
      if (!columnTypes.get(c).equals( TypeInfoFactory.stringTypeInfo)) {
        throw new SerDeException(getClass().getName()
            + " only accepts string columns, but column[" + c + "] named "
            + columnNames.get(c) + " has type " + columnTypes.get(c));
      }
    }

关于hive RegexSerDe的源码分析,布布扣,bubuko.com

时间： 2024-10-05 05:06:19

关于hive RegexSerDe的源码分析

关于hive RegexSerDe的源码分析的相关文章

Spark SQL Catalyst源码分析之TreeNode Library

Hadoop之HDFS原理及文件上传下载源码分析（下）

Spark SQL源码分析之核心流程

Spark SQL 源码分析之 In-Memory Columnar Storage 之 cache table

Spark SQL 源码分析之 In-Memory Columnar Storage 之 in-memory query

hadoop源码分析解读入门

第二篇：Spark SQL Catalyst源码分析之SqlParser

第一篇：Spark SQL源码分析之核心流程

第八篇：Spark SQL Catalyst源码分析之UDF