MapReduce处理表的自连接

原始数据

* 原始数据

* 子父

* Tom Lucy

Tom Jack

Jone Locy

Jone Jack

Lucy Mary

Lucy Ben

Jack Alice

Jack Jesse

TerryAlice

TerryJesse

PhilipAlma

Mark Terry

Mark Alma

要求通过子父关系找出子-祖母关系

* 设计方法：连接的左表的parent列(key),右表的child列(key)，且左右表属于同一张表

* 所以在map阶段将读入数据分割成child,parent后，会将parent设置成key,child设置成value输出，并作为左表

* 再将同一对child和parent中的child作为key，parent作为value进行输出，作为右表

* 为了区分输出中的左右表，需要在输出的value中再加上左右表的信息，比如在value的string最开始出加上字符1表示左表，加上2表示右表。

* 然后在shuffle过程中完成连接，reduce接收到连接的结果，其中每个key的value-list就包含了“grandchild-grandparent”关系。

* 取出每个key的value-list进行解析，将左表中的child放入一个数组(就一个key)，右表中的grandparent放入一个数组，然后对两个数组求笛卡儿积就ko了

1.Map类

package test.mr.selfrelated;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/*
 * 表的自连结(grandchild-grandparend表)
 */
/*
 * 原始数据
 * 子    父
 * Tom	Lucy
 Tom	Jack
 Jone	Locy
 Jone	Jack
 Lucy	Mary
 Lucy	Ben
 Jack	Alice
 Jack	Jesse
 TerryAlice
 TerryJesse
 PhilipAlma
 Mark	Terry
 Mark	Alma
 */
/*
 * 设计方法：连接的左表的parent列(key),右表的child列(key)，且左右表属于同一张表
 * 所以在map阶段将读入数据分割成child,parent后，会将parent设置成key,child设置成value输出，并作为左表
 * 再将同一对child和parent中的child作为key，parent作为value进行输出，作为右表
 * 为了区分输出中的左右表，需要在输出的value中再加上左右表的信息，比如在value的string最开始出加上字符1表示左表，加上2表示右表。
 * 然后在shuffle过程中完成连接，reduce接收到连接的结果，其中每个key的value-list就包含了“grandchild-grandparent”关系。
 * 取出每个key的value-list进行解析，将左表中的child放入一个数组(就一个key)，右表中的grandparent放入一个数组，然后对两个数组求笛卡儿积就ko了
 *
 */
public class selfRelatedMap extends Mapper<LongWritable, Text, Text, Text> {
	@Override
	protected void map(LongWritable key, Text value,
			Mapper<LongWritable, Text, Text, Text>.Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		if (line.trim().length() > 0) {
			String str[] = line.split("\t");
			if (str.length == 2) {
				context.write(new Text(str[1]), new Text("1_" + str[0])); // 左表
				context.write(new Text(str[0]), new Text("2_" + str[1])); // 右表
			}
		}

	}
}

2.Reduce类

package test.mr.selfrelated;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class selfRelatedRedu extends Reducer<Text, Text, Text, Text> {
	@Override
	protected void reduce(Text key, Iterable<Text> values,
			Reducer<Text, Text, Text, Text>.Context context)
			throws IOException, InterruptedException {
		List<String> grandsons = new ArrayList<String>();
		List<String> grandparents = new ArrayList<String>();
		for (Text t : values) {
			// 进行value字符串切分
			String str[] = t.toString().split("_");
			if ("1".equals(str[0])) {
				// 左表 //作为孙
				grandsons.add(str[1]);
			} else if ("2".equals(str[0])) {
				// 右表 //作为祖母辈
				grandparents.add(str[1]);
			}
		}
		// 做笛卡尔积
		for (String gc : grandsons) {
			for (String gp : grandparents) {
				context.write(new Text(gc), new Text(gp));
			}
		}
	}
}

3.job类

package test.mr.selfrelated;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class selfRelatedMain {
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = new Job(conf);
		job.setJarByClass(selfRelatedMain.class);

		job.setMapperClass(selfRelatedMap.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);

		job.setReducerClass(selfRelatedRedu.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);

		FileInputFormat.addInputPath(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		job.waitForCompletion(true);
	}
}

时间： 2024-10-20 13:27:03

MapReduce处理表的自连接的相关文章

2018-08-02 期 MapReduce实现多表查询自连接

1.员工对象EmployeeBean package cn.sjq.bigdata.mr.self.join; import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.Writable; /** * 员工对象EmployeeBean * 由于该对象需要做为Mapper的输出,因此需要实现Writable接口 * @author song

MapReduce单表关联学习~

首先考虑表的自连接,其次是列的设置,最后是结果的整理. 文件内容: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.ha

SQL 表的自连接不等值连接

题目内容:一个表T_20161004ID 字段为编号,递增不一定连续uPrice 字段为区段路费,比如从家到哈尔滨是60元,从哈尔滨到长春是70元数值类型Name 字段站点名称.现在有500元钱, 从家先经哈尔滨出发, 能走多远?ID uPrice Name1 60 哈尔滨2 70 长春3 80 沈阳4 50 北京5 90 郑州6 75 武汉7 80 长沙8 90 广州要求:请用一句SQL语句实现 CREATE TABLE T_20161004( ID INT IDENTITY(1,1) N

Hadoop MapReduce纵表转横表

输入数据如下:以\t分隔 0-3岁育儿百科书 23 0-5v液位传感器 5 0-5轴承 2 0-6个月奶粉 23 0-6个月奶粉c2c报告 23 0-6个月奶粉在线购物排名 23 0-6个月奶粉市场前景 23 0-6个月配方奶粉 23 0.001g电子天平 5 0.01t化铝炉 2 0.01吨熔铝合金炉 2 0.03吨化镁炉 25 0.03吨电磁炉 11 其中左侧是搜索词,右侧是类别,可看成是数据库中的纵表,现需要将输入转成横表,即类名\t语句1\t语句2...,这样的格式. MapRedu

oracle多表查询-自连接

自连接 *自连接的实质是:将同一张表看成是多张表. *举例: *查询所有员工的姓名及其直属上级的姓名喜欢就关注我哦原文地址:https://www.cnblogs.com/zhiyanwenlei/p/9638132.html

案例3，mapreduce单表关联，根据child-parient表解析出grandchild-grandparient表

1.数据样例如下 Tom Lucy Tom Jack Jone Lucy Jone Jack Lucy Mary Lucy Ben Jack Alice Jack Jesse Terry Alice Terry Jesse Philip Terry Philip Alma Mark Terry Mark Alma 2.map的代码如下: public static class ChildParentMapper extends MapReduceBase implements Mapper<Ob

MapReduce 多表连接

题目描述: 现在有两个文件,1为存放公司名字和城市ID,2为存放城市ID和城市名表一: factoryname,addressed Beijing Red Star,1 Shenzhen Thunder,3 Guangzhou Honda,2 Beijing Rising,1 Guangzhou Development Bank,2 Tencent,3 Back of Beijing,1 表2: 1,Beijing 2,Guangzhou 3,Shenzhen 4,Xian 现在要求输出公司名

Hadoop阅读笔记（三）——深入MapReduce排序和单表连接

继上篇了解了使用MapReduce计算平均数以及去重后,我们再来一探MapReduce在排序以及单表关联上的处理方法.在MapReduce系列的第一篇就有说过,MapReduce不仅是一种分布式的计算方法,更是一种解决问题的新思维.新思路.将原先看似可以一条龙似的处理一刀切成两端,一端是Map.一端是Reduce,Map负责分,Reduce负责合. 1.MapReduce排序问题模型: 给出多个数据文件输入如: sortfile1.txt 11 13 15 17 19 21 23 25 27

MapReduce程序之实现单表关联

设计思路分析这个实例,显然需要进行单表连接,连接的是左表的parent列和右表的child列,且左表和右表是同一个表. 连接结果中除去连接的两列就是所需要的结果--"grandchild--grandparent"表.要用MapReduce解决这个实例,首先应该考虑如何实现表的自连接:其次就是连接列的设置:最后是结果的整理. 考虑到MapReduce的shuffle过程会将相同的key会连接在一起,所以可以将map结果的key设置成待连接的列,然后列中相同的值就自然会连接在一起了.再