mapreduce 高级案例倒排索引

理解【倒排索引】的功能

熟悉mapreduce 中的combine 功能

根据需求编码实现【倒排索引】的功能，旨在理解mapreduce 的功能。

一：理解【倒排索引】的功能

1.1 倒排索引：

    由于不是根据文档来确定文档所包含的内容，而是进行相反的操作,因而称为倒排索引
    简单来说根据单词，返回它在哪个文件中出现过，而且频率是多少的结果。例如：就像百度里的搜索，你输入一个关键字，那么百度引擎就迅速的在它的服务器里找到有该关键字的文件，并根据频率和其他一些策略（如页面点击投票率）等来给你返回结果

二：熟悉mapreduce 中的combine 功能

2.1 mapreduce的combine 功能

   1 Map过程:Map过程首先分析输入的<key,value>对，得到索引中需要的信息：单词，文档URI 和词频。key：单词和URI.value：出现同样单词的次数。

2 Combine过程：经过map方法处理后，Combine过程将key值相同的value值累加，得到一个单词在文档中的词频。

3 Reduce过程：经过上述的俩个过程后，Reduce过程只需要将相同的key值的value值组合成倒排引索文件的格式即可，其余的事情直接交给MapReduce框架进行处理

三：根据需求编码实现【倒排索引】的功能，旨在理解mapreduce 的功能。

3.1 Java的编程代码

InvertedIndexMapReduce.java

package org.apache.hadoop.studyhadoop.index;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
 *
 * @author zhangyy
 *
 */
public class InvertedIndexMapReduce extends Configured implements Tool {
    // step 1 : mapper
    /**
     * public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>
     */
    public static class WordCountMapper extends //
            Mapper<LongWritable, Text, Text, Text> {

        private Text mapOutputKey = new Text();
        private Text mapOutputValue = new Text("1");

        @Override
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {

            // split1
            String[] lines = value.toString().split("##");
            // get url
            String url = lines[0];

            // split2
            String[] strs = lines[1].split(" ");

            for (String str : strs) {
                mapOutputKey.set(str + "," + url);
                context.write(mapOutputKey, mapOutputValue);
            }

        }
    }

    // set combiner class
    public static class InvertedIndexCombiner extends //
            Reducer<Text, Text, Text, Text> {

        private Text CombinerOutputKey = new Text();
        private Text CombinerOutputValue = new Text();

        @Override
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {

            // split
            String[] strs = key.toString().split(",");

            // set key
            CombinerOutputKey.set(strs[0] + "\n");

            // set value
            int sum = 0;
            for (Text value : values) {
                sum += Integer.valueOf(value.toString());
            }

            CombinerOutputValue.set(strs[1] + ":" + sum);

            context.write(CombinerOutputKey, CombinerOutputValue);

        }
    }

    // step 2 : reducer
    public static class WordCountReducer extends //
            Reducer<Text, Text, Text, Text> {

        private Text outputValue = new Text();

        @Override
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            // TODO

            String result = new String();

            for (Text value : values) {
                result += value.toString() + "\t";
            }

            outputValue.set(result);

            context.write(key, outputValue);
        }
    }

    // step 3 : job

    public int run(String[] args) throws Exception {

        // 1 : get configuration
        Configuration configuration = super.getConf();

        // 2 : create job
        Job job = Job.getInstance(//
                configuration,//
                this.getClass().getSimpleName());
        job.setJarByClass(InvertedIndexMapReduce.class);

        // job.setNumReduceTasks(tasks);

        // 3 : set job
        // input --> map --> reduce --> output
        // 3.1 : input
        Path inPath = new Path(args[0]);
        FileInputFormat.addInputPath(job, inPath);

        // 3.2 : mapper
        job.setMapperClass(WordCountMapper.class);
        // TODO
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        // ====================shuffle==========================
        // 1: partition
        // job.setPartitionerClass(cls);
        // 2: sort
        // job.setSortComparatorClass(cls);
        // 3: combine
        job.setCombinerClass(InvertedIndexCombiner.class);
        // 4: compress
        // set by configuration
        // 5 : group
        // job.setGroupingComparatorClass(cls);

        // ====================shuffle==========================

        // 3.3 : reducer
        job.setReducerClass(WordCountReducer.class);
        // TODO
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        // 3.4 : output
        Path outPath = new Path(args[1]);
        FileOutputFormat.setOutputPath(job, outPath);

        // 4 : submit job
        boolean isSuccess = job.waitForCompletion(true);
        return isSuccess ? 0 : 1;

    }

    public static void main(String[] args) throws Exception {

        args = new String[] {
                "hdfs://namenode01.hadoop.com:8020/input/index.txt",
                "hdfs://namenode01.hadoop.com:8020/outputindex/"
                };

        // get configuration
        Configuration configuration = new Configuration();

        // configuration.set(name, value);

        // run job
        int status = ToolRunner.run(//
                configuration,//
                new InvertedIndexMapReduce(),//
                args);

        // exit program
        System.exit(status);
    }

}

3.2 运行案例测试

上传文件：
hdfs dfs -put index.txt /input 

代码运行结果：

输出结果：

原文地址：http://blog.51cto.com/flyfish225/2096764

时间： 2024-12-13 21:01:04

mapreduce 高级案例倒排索引的相关文章

大数据技术之_05_Hadoop学习_04_MapReduce_Hadoop企业优化(重中之重)+HDFS小文件优化方法+MapReduce扩展案例+倒排索引案例(多job串联)+TopN案例+找博客共同粉丝案例+常见错误及解决方案

第6章 Hadoop企业优化(重中之重)6.1 MapReduce 跑的慢的原因6.2 MapReduce优化方法6.2.1 数据输入6.2.2 Map阶段6.2.3 Reduce阶段6.2.4 I/O传输6.2.5 数据倾斜问题6.2.6 常用的调优参数6.3 HDFS小文件优化方法6.3.1 HDFS小文件弊端6.3.2 HDFS小文件解决方案第7章 MapReduce扩展案例7.1 倒排索引案例(多job串联)7.2 TopN案例7.3 找博客共同粉丝案例第8章常见错误及解决方案第6章

MapReduce初级案例

1.数据去重 "数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选.统计大数据集上的数据种类个数.从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重.下面就进入这个实例的MapReduce程序设计. 1.1 实例描述对数据文件中的数据进行去重.数据文件中的每行都是一个数据. 样例输入如下所示: 1)file1: 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7

第3节 mapreduce高级：4、倒排索引的建立

倒排索引建立需求分析需求:有大量的文本(文档.网页),需要建立搜索索引最终实现的结果就是哪个单词在哪个文章当中出现了多少次思路分析: 首选将文档的内容全部读取出来,加上文档的名字作为key,文档的value为1,组织成这样的一种形式的数据 map端数据输出 hello-a.txt 1tom-a.txt 1hello-a.txt 1jerry-a.txt 1 到reduce阶段hello-a.txt <1,1> reduce端数据输出 hello-a.txt 2 tom-a.txt 1

MapReduce编程(七) 倒排索引构建

一.倒排索引简介倒排索引(英语:Inverted index),也常被称为反向索引.置入档案或反向档案,是一种索引方法,被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射.它是文档检索系统中最常用的数据结构. 以英文为例,下面是要被索引的文本: T0="it is what it is" T1＝"what is it" T2＝"it is a banana" 我们就能得到下面的反向文件索引: "a": {

MapReduce 应用案例分析 - 单词计数

需求计算出文件中每个单词的频数.要求输出结果按照单词的字母顺序进行排序.每个单词和其频数占一行,单词和频数之间有间隔. 比如,输入一个文件,其内容如下: hello world hello hadoop hello mapreduce 对应上面给出的输入样例,其输出样例为: hadoop 1 hello 3 mapreduce 1 world 1 方案制定对该案例,可设计出如下的MapReduce方案: 1. Map阶段各节点完成由输入数据到单词切分的工作 2. shuffle阶段完成相同单

MapReduce应用案例

1 环境说明注意:本实验是对前述实验的延续,如果直接点开始实验进入则需要按先前学习的方法启动hadoop 部署节点操作系统为CentOS,防火墙和SElinux禁用,创建了一个shiyanlou用户并在系统根目录下创建/app目录,用于存放 Hadoop等组件运行包.因为该目录用于安装hadoop等组件程序,用户对shiyanlou必须赋予rwx权限(一般做法是root用户在根目录下创建/app目录,并修改该目录拥有者为shiyanlou(chown –R shiyanlou:shiyanl

第3节 mapreduce高级：2、3、课程大纲&共同好友求取步骤一、二

第五天课程大纲:1.社交粉丝的数据分析:求共同好友2.倒排索引的建立3.自定义inputFormat合并小文件 4.自定义outputformat5.分组求topN6.MapReduce的其他补充了解7.mapreduce的参数优化理解8.yarn的资源调度管理例子1:社交粉丝数据分析逻辑分析以下是qq的好友列表数据,冒号前是一个用户,冒号后是该用户的所有好友(数据中的好友关系是单向的,即A的好友列表中有B,但B可能把A删除了) A:B,C,D,F,E,O B:A,C,E,K C:F,

MapReduce编程之倒排索引

任务要求: //输入文件格式 18661629496 110 13107702446 110 1234567 120 2345678 120 987654 110 2897839274 18661629496 //输出文件格式格式 11018661629496|13107702446|987654|18661629496|13107702446|987654| 1201234567|2345678|1234567|2345678| 186616294962897839274|2897839274

MapReduce应用案例--简单排序

1. 设计思路在MapReduce过程中自带有排序,可以使用这个默认的排序达到我们的目的. MapReduce 是按照key值进行排序的,我们在Map过程中将读入的数据转化成IntWritable类型,然后作为Map的key值输出. Reduce 阶段拿到的就是按照key值排序好的<key,value list>,将key值输出,并根据value list 中元素的个数决定key的输出次数. 2. 实现 2.1 程序代码 package sort; import java.io.IOExce