hive本身支持的子查询非常有限,Hive不支持where子句中的子查询,只允许子查询在from中出现
错误写法:
insert into table branch_atmzc_sum Select XT_OP_TRL, SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt as cr_tx_amt from branch_atmzc a join branch_atmzc b on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘), counts from branch_atmzc group by XT_OP_TRL, SA_TX_DT,cr_tx_amt,counts;
正确写法:
insert into table branch_atmzc_sum Select a.XT_OP_TRL, a.SA_TX_DT,"取款-存款",b.cr_tx_amt- a.cr_tx_amt ,b.counts+a.counts from branch_atmzc a join branch_atmzc b on (a.XT_OP_TRL = b.XT_OP_TRL and a.SA_TX_DT = b.SA_TX_DT and a.tran_cd = ‘ATM存款‘ and b.tran_cd = ‘ATM取款‘)
另外一种:考虑效率
SELECT t1.产品类型, COUNT(DISTINCT (IF(t2.用户ID IS NULL, NULL, t1.用户ID))) AS KEEP_UV FROM ( SELECT 产品类型, 用户ID FROM 事实表 WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228) ) t1 LEFT OUTER JOIN ( SELECT 产品类型, 用户ID FROM 事实表 WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131) ) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID) GROUP BY t1.产品类型
本身表包含的字段信息多,时间跨度大。对于这种对于IN / EXISTS子查询(准确地说,这里是非相关子查询)有一种高效的实现,就是LEFT SEMI JOIN:
LEFT SEMI JOIN implements the uncorrelated IN/EXISTS subquery semantics in an efficient way.
left join
SELECT 产品类型, COUNT(DISTINCT t1.用户ID) AS KEEP_UV FROM ( SELECT 产品类型, 用户ID FROM 事实表 WHERE (`DATE` >= 20140201 AND `DATE` <= 20140228) ) t1 LEFT SEMI JOIN ( SELECT 产品类型, 用户ID FROM 事实表 WHERE (`DATE` >= 20140101 AND `DATE` <= 20140131) ) t2 ON (t1.产品类型 = t2.产品类型 AND t1.用户ID = t2.用户ID) GROUP BY 产品类型
时间: 2024-10-11 14:46:02