Sample Join Analysis

Sample data: student.txt

1,yaoshuya,25
2,yaoxiaohua,29
3,yaoyuanyie,15
4,yaoshupei,26

Sample data:score.txt

1,yuwen,100
1,shuxue,99
2,yuwen,99
2,shuxue,88
3,yuwen,99
3,shuxue,56
4,yuwen,33
4,shuxue,99

输出文件内容：

1    [yaoshuya,25,yuwen,100]
1    [yaoshuya,25,shuxue,99]
2    [yaoxiaohua,29,yuwen,99]
2    [yaoxiaohua,29,shuxue,88]
3    [yaoyuanyie,15,yuwen,99]
3    [yaoyuanyie,15,shuxue,56]
4    [yaoshupei,26,yuwen,33]
4    [yaoshupei,26,shuxue,99]

参数：

args= "-Dio.sort.mb=10

-r 1

-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat

-outFormat org.apache.hadoop.mapred.TextOutputFormat

-outKey org.apache.hadoop.io.Text

-outValue org.apache.hadoop.mapred.join.TupleWritable

hdfs://namenode:9000/user/hadoop/student/student.txt

hdfs://namenode:9000/user/hadoop/student/score2.txt

hdfs://namenode:9000/user/hadoop/joinout".split(" ");

需要注意的是我使用的输出格式是TextOutputFormat（完全是为了方便观察输出后的数据）

输出的valuetype是org.apache.hadoop.mapred.join.TupleWritable ,这个类型非常方便，类似于数组类型，可以接受多值。

在源码中添加的一句代码，是用来配置我的数据源文件的keyvalue分隔符是,(comma).

jobConf.set("key.value.separator.in.input.line", ",");

关键代码简析：

job.setInputFormatClass(CompositeInputFormat.class);
job.getConfiguration().set(CompositeInputFormat.JOIN_EXPR,
CompositeInputFormat.compose(op, inputFormatClass,
plist.toArray(new Path[0])));

使用CompositeInputFormat来进行join操作。此类的说明：

/**
* An InputFormat capable of performing joins over a set of data sources sorted
* and partitioned the same way.
*
* A user may define new join types by setting the property
* <tt>mapreduce.join.define.<ident></tt> to a classname.
* In the expression <tt>mapreduce.join.expr</tt>, the identifier will be
* assumed to be a ComposableRecordReader.
* <tt>mapreduce.join.keycomparator</tt> can be a classname used to compare
* keys in the join.
* @see #setFormat
* @see JoinRecordReader
* @see MultiFilterRecordReader
*/

通过op来指定连接类型：inner,outer,tbl等，有其他需要也可以实现。

具体是怎么连接的呢？根据两个source进入mapper的key进行归并连接。所以要求数据源是根据key值有序的。此连接是在map端实现的。

测试中我使用KeyValueTextInputFormat来处理，其默认格式是key\tValue,所以我使用了上面的代码来进行重置这个格式。但如果你的文件不是key放在第一个位置，你就需要自己写FileInputFormat啦。

但明显需要你要处理的数据源都是使用同样的FileInputFormat去读取。

还有一点，这里支持多文件连接，示例中我只使用了两个示例文件，可以添加更多的文件，路径添加到outputdir之前即可。

时间： 2024-12-16 11:16:16

Sample Join Analysis的相关文章

lua sample code analysis

What is a meta table a meta table has a __name attr whose value is name of metatable a meta table is stored in LUA_REGISTRYINDEX whose key is its name Code analysis Appl DUMP_STACK(L); /*{ "foo", "C:\\jshe\\codes\\mylualib\\test\\../build/v

Reducejoin sample

示例文件同sample join analysis 之前的示例是使用map端的join.这次使用reduce端的join. 根据源的类别写不同的mapper,处理不同的文件,输出的key都是studentno.value是其他的信息同时加上类别信息. 然后使用multipleinputs不同的路径注册不同的mapper. reduce端相同的studentno的学生信息和考试成绩分配给同一个reduce,而且value中包含了这些信息, 把这些信息抽取出来,再做笛卡尔积即可. 下面的示例代码中,

基本概念之六

方法 - 把数组元素连接成一个字符串 .join() - .join(value) sample .join("*") var landmarks = [];landmarks.push("My House");landmarks.push("Front path");landmarks.push("Flikering streetlamp");landmarks.push("Leaky fire hydrant&q

第二期：关于十大数据相关问答汇总，关注持续更新中哦~

NO.1 学大数据如何零基础入门? 答:学习任何东西都一样,一开始就是一道坎,我很喜欢看书,特别是容易入门的书.对于大数据,我的具体研究方向是大规模数据的机器学习应用,所以首先要掌握以下基本概念.微积分(求导,极值,极限)线性代数(矩阵表示.矩阵计算.特征根.特征向量)概率论+统计(很多数据分析建模基于统计模型).统计推断.随机过程线性规划+凸优化.非线性规划等*数值计算.数值线代等当然一开始只要有微积分.线代以及概率论基本上就可以入门机器学习,我强烈推荐几本书,这几本书不需要看完,只需要对其中

标准化数据-StandardScaler

StandardScaler----计算训练集的平均值和标准差,以便测试数据集使用相同的变换官方文档: class sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True) Standardize features by removing the mean and scaling to unit variance 通过删除平均值和缩放到单位方差来标准化特征 The standard score

Linux Rootkit Sample && Rootkit Defenser Analysis

目录 1. 引言 2. LRK5 Rootkit 3. knark Rootkit 3. Suckit(super user control kit) 4. adore-ng 5. WNPS 6. Sample Rootkit for Linux 7. suterusu 8. Rootkit Defense Tools 9. Linux Rootkit Scanner: kjackal 1. 引言 This paper attempts to analyze the characteristic

SPSS|Data|Transfer|Analysis|Label|One sample test|Testval|Criables|

生物统计与实验设计-使用SPSS Data用于整合:Transfer用于预处理:Analysis用于数据的二维呈现:Label是在报表中呈现的名字: 给离散值编码: 对于离散值做数学计算: 均值比较用于假设检验,可以使用描述统计找原始数据的问题.可以通过以下指标:看均值,方差,min,max,来检查有无异常值,或者构造指标. One sample test查看数值是否符合某分布的特征值,Testval是用于比较的均值:Criables是置信区间. 是取到该值的概率图像: 显示数字对应的标签原

trinity based DEG analysis

Identifying Differentially Expressed Trinity Transcripts Our current system for identifying differentially expressed transcripts relies on using the EdgeR Bioconductor package. We have a protocol and scripts described below for identifying differenti

Delete,Update与LEFT Join

UPDATE:UPDATE A SET ApproverID=NULL FROM [SH_MaterialApplyBuyBill] A LEFT JOIN [SH_MaterialApplyBuyBillDetail] B ON A.ID=B.[MaterialApplyBuyBillID]WHERE A.id=125 AND @InDetailCount=0DELETE:DELETE A FROM [SH_ClosingBalance] A LEFT JOIN [SH_StoreHouse]