pig对null的处理(实际,对空文本处理为两种取值null或‘’)

pig对文本null的处理很特殊。会处理成两种null,还会处理成‘‘这种空值。

比如,读name,age,sex日志信息。name取值处理,如果记录为“,,,”这样,会将name取值为null,如果记录为“,19,男”则name会处理为‘‘。同样是空值,pig读取后的取值却不一样。所以一定要小心。

pig读取日志信息,遇到取值为空的字段会处理为两种,一种取值为‘‘,另一种为null。

具体例子:读取日志中倒数第4个字段(全部为空,两个逗号之间无值",,"),

pig读入后处理为两种值(‘‘和 null),

1日志中空处理为null:

(5,148,b84daa9b-194e-4c4c-9595-ce4bfabca918,605378805132617404,2014-11-05 18:31:05,2014-11-05 18:31:05,1,62052,2,,,,,,,,239.130.237.121,2,-1,,,-1,e15b6c6675c6d6e8eb7851ccc866608787daeadd,b84daa9b-194e-4c4c-9595-ce4bfabca918,02:00:00:00:00:00,-991608703440210811,,,,,75061,200,,2,2,1,7.0,,,,,,)

(5,148,b84daa9b-194e-4c4c-9595-ce4bfabca918,605378805132617404,2014-11-05 18:31:05,2014-11-05 18:31:05,2,62052,2,,,,,,,,239.130.237.121,2,-1,,,-1,e15b6c6675c6d6e8eb7851ccc866608787daeadd,b84daa9b-194e-4c4c-9595-ce4bfabca918,02:00:00:00:00:00,-991608703440210811,,,,,75061,200,,2,2,1,7.0,,,,,,)

2日志中的空处理为‘‘:

(3,90,864616028213476,1412364855586,2014-08-25 15:07:42,,1,14999,2,,,,,,460,00,112.5.236.229,2,864616028213476,3ff1c154fb35073a,,864616028213476|3ff1c154fb35073a,,,,864616028213476|3ff1c154fb35073a,,,,,311,35,-1,,1,3,2.x,1,91,,35.0,105.0,132012121230123)

(5,148,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,682937329735483418,2014-11-04 20:08:37,2014-11-04 20:08:37,1,62052,2,,,,,,,,160.35.136.117,1,-1,,,-1,e72da4be06382bd0826be09927f650ca2570add9,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,02:00:00:00:00:00,-3733654770696849299,,,,,66454,206,,2,2,1,7.1,,,,38.878998,-76.9898,032010032322002)

(3,90,864616028213476,1412364855586,2014-08-25 15:07:42,,2,14999,2,,,,,,460,00,112.5.236.229,2,864616028213476,3ff1c154fb35073a,,864616028213476|3ff1c154fb35073a,,,,864616028213476|3ff1c154fb35073a,,,,,311,35,-1,,1,3,2.x,1,91,,35.0,105.0,132012121230123)

(5,148,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,682937329735483418,2014-11-04 20:08:37,2014-11-04 20:08:37,2,62052,2,,,,,,,,160.35.136.117,1,-1,,,-1,e72da4be06382bd0826be09927f650ca2570add9,ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453,02:00:00:00:00:00,-3733654770696849299,,,,,66454,206,,2,2,1,7.1,,,,38.878998,-76.9898,032010032322002)

处理代码如下:

--citylevel report analysis:pig -p date=2014-07-30 -p year=2014 -p file_path=/user/wizad/test -f

SET job.name ‘test_citylevel_reporth_istorical‘;

SET job.priority HIGH;

--REGISTER piggybank.jar;

REGISTER wizad-etl-udf-0.1.jar;

--DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();

DEFINE SequenceFileLoader com.XXX.xxx.etl.pig.SequenceFileCSVLoader();

%default Cleaned_Log /user/wizad/test/wizad/cleaned/2014-10*/*/part*

%default AD_Data /user/wizad/data/wizad/metadata/ad/part*

%default Campaign_Data /user/wizad/data/wizad/metadata/campaign/part*

%default Region_Template /user/wizad/data/wizad/metadata/region_template/part-m-00000

%default Addtion_Data /user/wizad/data/report/region_addition/addition_data.txt

%default Industry_Path $file_path/report/historical/citylevel/$year/industry

%default Industry_Path $file_path/report/historical/citylevel/$year/industry

%default Industry_SUM $file_path/report/historical/citylevel/$year/industry_sum

%default Industry_TMP $file_path/report/historical/citylevel/$year/industry_tmp

%default Industry_Brand_Path $file_path/report/historical/citylevel/$year/industry_brand

%default Industry_Brand_SUM $file_path/report/historical/citylevel/$year/industry_brand_sum

%default Industry_Brand_TMP $file_path/report/historical/citylevel/$year/industry_brand_tmp

%default ALL_Path $file_path/report/historical/citylevel/$year/all

%default ALL_SUM $file_path/report/historical/citylevel/$year/all_sum

%default ALL_TMP $file_path/report/historical/citylevel/$year/all_tmp

%default output_path /user/wizad/tmp/result

--origin_cleaned_data = LOAD ‘$Cleaned_Log‘ USING PigStorage(‘,‘)

origin_cleaned_data = LOAD ‘$Cleaned_Log‘ USING SequenceFileLoader

AS (ad_network_id:chararray,

wizad_ad_id:chararray,

guid:chararray,

id:chararray,

create_time:chararray,

action_time:chararray,

log_type:chararray,

ad_id:chararray,

positioning_method:chararray,

location_accuracy:chararray,

lat:chararray,

lon:chararray,

cell_id:chararray,

lac:chararray,

mcc:chararray,

mnc:chararray,

ip:chararray,

connection_type:chararray,

imei:chararray,

android_id:chararray,

android_advertising_id:chararray,

udid:chararray,

openudid:chararray,

idfa:chararray,

mac_address:chararray,

uid:chararray,

density:chararray,

screen_height:chararray,

screen_width:chararray,

user_agent:chararray,

app_id:chararray,

app_category_id:chararray,

device_model_id:chararray,

carrier_id:chararray,

os_id:chararray,

device_type:chararray,

os_version:chararray,

country_region_id:chararray,

province_region_id:chararray,

city_region_id:chararray,

ip_lat:chararray,

ip_lon:chararray,

quadkey:chararray);

my_test1 = filter origin_cleaned_data by guid == ‘b84daa9b-194e-4c4c-9595-ce4bfabca918‘;

dump my_test1;

describe my_test1;

--store my_test into ‘$output_path/mytest‘ using PigStorage(‘,‘);

my_test2 = filter origin_cleaned_data by guid == ‘864616028213476‘ or guid == ‘ddeb5f0f-09a7-456e-a9dc-5fb5e96c5453‘;

dump my_test2;

describe my_test2;

--store my_test into ‘$output_path/mytest‘ using PigStorage(‘,‘);

--将第2种空取值‘‘过滤为unknown

unknown_data = FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,

((city_region_id == ‘‘) ? ‘unknown‘ : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)

--将第1种空取值null过滤为isnull

null_data =  FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,

((city_region_id is NULL) ? ‘isnull‘ : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)

--看看unknown和isnull的数据

all_unknown = filter unknown_data by city_region_id == ‘unknown‘;

dump all_unknown;

--store all_unknown into ‘$output_path/unknown‘ using PigStorage(‘,‘);

all_null = filter null_data by city_region_id == ‘isnull‘;

dump all_null;

--store all_null into ‘$output_path/isnull‘ using PigStorage(‘,‘);

--把两种都过滤为no_use

origin_historical = FOREACH origin_cleaned_data GENERATE wizad_ad_id,guid,log_type,

((city_region_id == ‘‘) or (city_region_id == null) or (city_region_id is null) ? ‘no_use‘ : city_region_id) AS city_region_id;  --(wizad_ad_id,guid,log_type,city_region_id)

dump origin_historical;

describe origin_historical;

两种数据分别的结果如下:

unknown数据:

(90,864616028213476,1,unknown)

(90,862624024878336,1,unknown)

(90,990001402489819,1,unknown)

(90,862343020727070,1,unknown)

(201,1ff90f55-f5cd-4b2a-9357-5bde0e3ff526,1,unknown)

(201,c3916c92-a70c-4d34-babd-d3fc021cf642,1,unknown)

(201,00:c6:10:dd:81:17,1,unknown)

(201,88:53:95:da:9e:03,1,unknown)

......

而null数据:

(148,b84daa9b-194e-4c4c-9595-ce4bfabca918,1,isnull)

(148,13fbe940-7cd0-44a1-b637-a0df8ea83621,1,isnull)

(148,b84daa9b-194e-4c4c-9595-ce4bfabca918,2,isnull)

(148,13fbe940-7cd0-44a1-b637-a0df8ea83621,2,isnull)

时间: 2024-08-26 17:07:21

pig对null的处理(实际,对空文本处理为两种取值null或‘’)的相关文章

不能将值 NULL 插入列 'ID',表 'EupStoreDemoDB.dbo.OrderDiary';列不允许有 Null 值。INSERT 失败。

MVC,使用EF构建实体.将数据存入数据库,执行到_db.SaveChange()时,会报如下错误: 在进行数据库数据存储时,经常会碰到这个问题,这个错误的成因有多种,上网搜解决方法,如下: 1.有说把可否为空选项设置为“可以为空”,可我这“ID”字段是主键,怎么可能设置为可空?否决 2.说把标识规范设置为“是”,如下图,但是,标识规范选择为“是”是给字段增加自增功能,这应该要求字段类型是整型吧?我的字段类型为“Guid",不能修改标识规范为”是“,所以,第二种解决方案解决不了我的问题,否决 网

Javascript 中的false,零值,null,undefined和空字符串对象

在Javascript中,我们经常会接触到题目中提到的这5个比较特别的对象--false.0.空字符串.null和undefined.这几个对象很容易用错,因此在使用时必须得小心. 类型检测 我们下来看看他们的类型分别是什么: <script type="text/javascript"> alert(typeof(false) === 'boolean'); alert(typeof(0) === 'number'); alert(typeof("")

单例模式在多线程环境下的lazy模式为什么要加两个if(instance==null)

刚才在看阿寻的博客”C#设计模式学习笔记-单例模式“时,发现了评论里有几个人在问单例模式在多线程环境下为什么lazy模式要加两个if进行判断,评论中的一个哥们剑过不留痕,给他们写了一个demo来告诉他们为什么. 我看了一下这个demo,确实说明了这个问题,但我认为不够直观,呵呵,于是我就稍微的改了一下. 这是剑过不留痕的demo using System; using System.Threading; namespace SingletonPattern { class Program { s

创建SQL作业错误的解决方法(不能将值 NULL 插入列 &#39;owner_sid&#39;,表 &#39;msdb.dbo.sysjobs&#39;;列不允许有空值。)

在用SQL语句创建SQL Server作业时有时出现如下错误: 消息 515,级别 16,状态 2,过程 sp_add_job,第 137 行 不能将值 NULL 插入列 'owner_sid',表 'msdb.dbo.sysjobs':列不允许有空值.INSERT 失败. 语句已终止. 这可能与为作业创建的数据库登录ID有关,这个登录ID需要是数据库的所有者(我的是sa),因此将 @owner_login_name=N'HYSERITC003/wellcomm', 中的N'HYSERITC00

“xxxx”表 - 无法修改表。 不能将值 NULL 插入列 &#39;xxxx&#39;

问题 向已有表增加字段 执行下面sql,sql执行增加两个字段分别: articleTitle 正标题 [nvarchar](200) articleSubTitle 副标题 [nvarchar](200) USE [tg_thirdparty] GO /****** Object: Table [dbo].[WX_Activity] Script Date: 10/31/2018 10:54:46 ******/ SET ANSI_NULLS ON GO SET QUOTED_IDENTIFI

使用mybatis如果类属性名和数据库中的属性名不一样取值就会为null

使用mybatis时如果类属性名和数据库中的属性名不一样取值就会为null 这是不能再去改变javabean中的属性,只能改变sql语句.语句如下所示: <select id="selectEmp" resultType="com.atguigu.mybatis.bean.Employee"> select id,last_name lastName,gender,email from tbl_employee where id= #{id} </

【XAF问题】不能将值NULL插入列&quot;Oid&quot;

一.问题 1. 不能将值NULL插入列"Oid" 二.解决方法 解决方法:删表,oid不能为空,继承的对象变了 原文地址:https://www.cnblogs.com/qy1234/p/10551249.html

@Value取值为NULL的解决方案

在spring mvc架构中,如果希望在程序中直接使用properties中定义的配置值,通常使用一下方式来获取: @Value("${tag}") private String tagValue; 但是取值时,有时这个tagvalue为NULL,可能原因有: 使用static或final修饰了tagValue,如下: private static String tagValue; //错误 private final String tagValue; //错误 类没有加上@Compo

SQL触发器中若取到null值可能引发的问题

declare @code varchar(20), @cs varchar(20),@zc varchar(20)set @cs='('[email protected]+'*'[email protected]+')'print '字符'[email protected]insert into TESTER2 values(@cs,@zc,@cs) declare @code varchar(20), @cs varchar(20),@zc varchar(20)set @cs='1'set