MapReduce应用案例--单表关联

1. 实例描述

　　单表关联这个实例要求从给出的数据中寻找出所关心的数据，它是对原始数据所包含信息的挖掘。

　　实例中给出child-parent 表，求出grandchild-grandparent表。

　　输入数据 file01:

child        parent
Tom          Lucy
Tom          Jack
Jone         Lucy
Jone         Jack
Lucy         Marry
Lucy         Ben
Jack         Alice
Jack         Jesse
Terry        Alice
Terry        Jesse
Philip       Terry
Philip       Alma
Mark         Terry
Mark         Alma

希望输出为:

grandchild    grandparent
Tom    Alice
Tom    Jesse
Jone    Alice
Jone    Jesse
Tom    Marry
Tom    Ben
Jone    Marry
Jone    Ben
Philip    Alice
Philip    Jesse
Mark    Alice
Mark    Jesse

2. 设计思路

　　1. 在map阶段，将原数据进行分割，将parent作为map输出的key值，child作为map输出的value值，这样形成左表。

　　2. 同时在map阶段过程中，将child作为map输出的key值，parent作为map输出的value值，这样形成右表。

　　3. 连接左表的paren列和右表的child列。

3. 具体实现

package tablerelation;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 *
 * @author Amei 单表链接，求grandchild grandparent表
 */

public class SingleTableRelation {
    public static int time = 0;

    /**
     *
     * @author Amei 左表的paren 和 右表的 child 做链接
     */
    public static class Map extends Mapper<LongWritable, Text, Text, Text> {
        protected void map(LongWritable key, Text value, Context context)
                throws java.io.IOException, InterruptedException {　　　　　　　// 左右表的标识
            int relation;
            StringTokenizer tokenizer = new StringTokenizer(value.toString());
            String child = tokenizer.nextToken();
            String parent = tokenizer.nextToken();
            if (child.compareTo("child") != 0) {
                // 左表
                relation = 1;
                context.write(new Text(parent),
                        new Text(relation + "+" + child));
                // 右表
                relation = 2;
                context.write(new Text(child),
                        new Text(relation + "+" + parent));
            }
        };

    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        protected void reduce(Text key, Iterable<Text> values,
                Reducer<Text, Text, Text, Text>.Context output)
                throws java.io.IOException, InterruptedException {
            int grandchildnum = 0;
            int grandparentnum = 0;
            List<String> grandchilds = new ArrayList<>();
            List<String> grandparents = new ArrayList<>();

            /** 输出表头 */
            if (time == 0) {
                output.write(new Text("grandchild"), new Text("grandparent"));
                time++;
            }
            for (Text val : values) {
                String record = val.toString();
                char relation = record.charAt(0);
                // 取出此时key所对应的child
                if (relation == ‘1‘) {
                    String child = record.substring(2);
                    grandchilds.add(child);
                    grandchildnum++;
                }
                // 取出此时key所对应的parent
                else {
                    String parent = record.substring(2);
                    grandparents.add(parent);
                    grandparentnum++;
                }
            }
            if (grandchildnum != 0 && grandparentnum != 0) {
                for (int i = 0; i < grandchildnum; i++)
                    for (int j = 0; j < grandparentnum; j++)
                        output.write(new Text(grandchilds.get(i)), new Text(
                                grandparents.get(j)));
            }

        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();

        Job job =  new Job(conf,"single tale relation");
        job.setJarByClass(SingleTableRelation.class);
        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path("/user/hadoop_admin/singletalein"));
        FileOutputFormat.setOutputPath(job, new Path("/user/hadoop_admin/singletableout"));

        System.exit((job.waitForCompletion(true) ? 0 : 1));
    }
}

时间： 2024-08-27 13:38:02

MapReduce应用案例--单表关联的相关文章

Hadoop阅读笔记（三）——深入MapReduce排序和单表连接

继上篇了解了使用MapReduce计算平均数以及去重后,我们再来一探MapReduce在排序以及单表关联上的处理方法.在MapReduce系列的第一篇就有说过,MapReduce不仅是一种分布式的计算方法,更是一种解决问题的新思维.新思路.将原先看似可以一条龙似的处理一刀切成两端,一端是Map.一端是Reduce,Map负责分,Reduce负责合. 1.MapReduce排序问题模型: 给出多个数据文件输入如: sortfile1.txt 11 13 15 17 19 21 23 25 27

Hadoop on Mac with IntelliJ IDEA - 8 单表关联NullPointerException

简化陆喜恒. Hadoop实战(第2版)5.4单表关联的代码时遇到空指向异常,经分析是逻辑问题,在此做个记录. 环境:Mac OS X 10.9.5, IntelliJ IDEA 13.1.5, Hadoop 1.2.1 改好的代码如下,在reduce阶段遇到了NullPointerException. 1 public class STjoinEx { 2 private static final String TIMES = "TIMES"; 3 4 public static v

MapReduce程序之实现单表关联

设计思路分析这个实例,显然需要进行单表连接,连接的是左表的parent列和右表的child列,且左表和右表是同一个表. 连接结果中除去连接的两列就是所需要的结果--"grandchild--grandparent"表.要用MapReduce解决这个实例,首先应该考虑如何实现表的自连接:其次就是连接列的设置:最后是结果的整理. 考虑到MapReduce的shuffle过程会将相同的key会连接在一起,所以可以将map结果的key设置成待连接的列,然后列中相同的值就自然会连接在一起了.再

MapReduce编程系列 — 5：单表关联

1.项目名称: 2.项目数据: chile parentTom LucyTom JackJone LucyJone JackLucy MaryLucy BenJack AliceJack JesseTerry AliceTerry JessePhilip TerryPhilip AlimaMark TerryMark Alma 3.设计思路: 分析这个实例,显然需要进行单表连接,连接的是左表的parent列

案例3，mapreduce单表关联，根据child-parient表解析出grandchild-grandparient表

1.数据样例如下 Tom Lucy Tom Jack Jone Lucy Jone Jack Lucy Mary Lucy Ben Jack Alice Jack Jesse Terry Alice Terry Jesse Philip Terry Philip Alma Mark Terry Mark Alma 2.map的代码如下: public static class ChildParentMapper extends MapReduceBase implements Mapper<Ob

MapReduce单表关联学习~

首先考虑表的自连接,其次是列的设置,最后是结果的整理. 文件内容: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.ha

数据库单表关联实现数据计数功能（表格自身关联）

注:本博文为博主原创,转载请注明出处. 问题提出:在数据库中有这样一张表格,其中主要包含三个字段,GoodsId(货物编号),TypeId(类型编号),State(状态).货物编号不重复,GoodsId与TypeId之间属于N对1的关系,State为状态主要分为两种"null"和"1","null"表示该货物未卖出UnSold,"1"表示货物Sold.现在有如下业务,统计出类型编号下有多少货物Sold,多少货物UnSold.

mapreduce-实现单表关联

//map类 package hadoop3; import java.io.IOException; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper; public class danbiaomap extends Mapper <LongWritable,Text,Text,Text>{ String chi

MapReduce实现单表链接

单表关联实例中给出child-parent(孩子——父母)表,要求输出grandchild-grandparent(孙子——爷奶)表. file: child parent Tom Lucy Tom Jack Jone Lucy Jone Jack Lucy Mary Lucy Ben Jack Alice Jack Jesse Terry Alice Terry Jesse Philip Terry Philip Alma Mark Terry Mark Alma 设计思路 MapReduc