倒排索引建立
需求分析
需求:有大量的文本(文档、网页),需要建立搜索索引
最终实现的结果就是哪个单词在哪个文章当中出现了多少次
思路分析:
首选将文档的内容全部读取出来,加上文档的名字作为key,文档的value为1,组织成这样的一种形式的数据
map端数据输出
hello-a.txt 1
tom-a.txt 1
hello-a.txt 1
jerry-a.txt 1
到reduce阶段
hello-a.txt <1,1>
reduce端数据输出
hello-a.txt 2
tom-a.txt 1
jerry-a.txt 1
代码:
IndexMain:。。。
TextInputFormat.addInputPath(job,new Path("file:///D:\\Study\\BigData\\heima\\stage2\\5、大数据离线第五天\\倒排索引\\input"));
TextOutputFormat.setOutputPath(job,new Path("file:///D:\\Study\\BigData\\heima\\stage2\\5、大数据离线第五天\\倒排索引\\out_index"));。。。
IndexMapper:
package cn.itcast.demo2.index; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileSplit; import java.io.IOException; public class IndexMapper extends Mapper<LongWritable,Text,Text,LongWritable> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { //判断数据是从哪个文件里面来的 //获取文件的切片 FileSplit inputSplit = (FileSplit) context.getInputSplit(); //获取到了我们的文件名 String name = inputSplit.getPath().getName(); String line = value.toString(); String[] split = line.split(" "); for(String word:split){ //输出格式:tom-b.txt 1 context.write(new Text(word+"-"+name),new LongWritable(1L)); } }}
IndexReducer:
package cn.itcast.demo2.index; import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer; import java.io.IOException; public class IndexReducer extends Reducer<Text,LongWritable,Text,LongWritable> { @Override protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long num = 0L; for(LongWritable longWritable:values){ num++; } context.write(key,new LongWritable(num)); }}
原文地址:https://www.cnblogs.com/mediocreWorld/p/11031111.html
时间: 2024-11-14 13:55:28