MapReduce实现数据去重

一、原理分析

　　Mapreduce的处理过程，由于Mapreduce会在Map~reduce中，将重复的Key合并在一起，所以Mapreduce很容易就去除重复的行。Map无须做任何处理，设置Map中写入context的东西为不作任何处理的行，也就是Map中最初处理的value即可，而Reduce同样无须做任何处理，写入输出文件的东西就是，最初得到的Key。

　　我原来以为是map阶段用了hashmap，根据hash值的唯一性。估计应该不是...

　　Map是输入文件有几行，就运行几次。

二、代码

2.1 Mapper

package algorithm;

import java.io.IOException;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class DuplicateRemoveMapper extends
		Mapper<LongWritable, Text, Text, Text> {
	//输入文件是数字 不过可能也有字符等 所以用Text，不用LongWritable
	public void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		context.write(value, new Text());//后面不能是null，否则，空指针

	}

}

2.2 Reducer

package algorithm;

import java.io.IOException;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class DuplicateRemoveReducer extends Reducer<Text, Text, Text, Text> {

	public void reduce(Text key, Iterable<Text> value, Context context)
			throws IOException, InterruptedException {
		// process values
		context.write(key, null); //可以出处null
	}

}

2.3 Main

package algorithm;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DuplicateMainMR  {

	public static void main(String[] args) throws Exception{
		// TODO Auto-generated method stub
		Configuration conf = new Configuration();
		Job job = new Job(conf,"DuplicateRemove");
		job.setJarByClass(DuplicateMainMR.class);
		job.setMapperClass(DuplicateRemoveMapper.class);
		job.setReducerClass(DuplicateRemoveReducer.class);
		job.setOutputKeyClass(Text.class);
		//输出是null，不过不能随意写  否则包类型不匹配
		job.setOutputValueClass(Text.class);

		job.setNumReduceTasks(1);
		//hdfs上写错了文件名 DupblicateRemove  多了个b
		//hdfs不支持修改操作
		FileInputFormat.addInputPath(job, new Path("hdfs://192.168.58.180:8020/ClassicalTest/DupblicateRemove/DuplicateRemove.txt"));
		FileOutputFormat.setOutputPath(job, new Path("hdfs://192.168.58.180:8020/ClassicalTest/DuplicateRemove/DuplicateRemoveOut"));
		System.exit(job.waitForCompletion(true) ? 0 : 1);
	}

}

三、输出分析

3.1 输入与输出

没啥要对比的....不贴了

3.2 控制台

doop.mapreduce.Job.updateStatus(Job.java:323)
  INFO - Job job_local4032991_0001 completed successfully
 DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.getCounters(Job.java:765)
  INFO - Counters: 38
	File System Counters
		FILE: Number of bytes read=560
		FILE: Number of bytes written=501592
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=48
		HDFS: Number of bytes written=14
		HDFS: Number of read operations=13
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=4
	Map-Reduce Framework
		Map input records=8
		Map output records=8
		Map output bytes=26
		Map output materialized bytes=48
		Input split bytes=142
		Combine input records=0
		Combine output records=0
		Reduce input groups=6
		Reduce shuffle bytes=48
		Reduce input records=8
		Reduce output records=6
		Spilled Records=16
		Shuffled Maps =1
		Failed Shuffles=0
		Merged Map outputs=1
		GC time elapsed (ms)=4
		CPU time spent (ms)=0
		Physical memory (bytes) snapshot=0
		Virtual memory (bytes) snapshot=0
		Total committed heap usage (bytes)=457179136
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters
		Bytes Read=24
	File Output Format Counters
		Bytes Written=14
 DEBUG - PrivilegedAction as:hxsyl (auth:SIMPLE) from:org.apache.hadoop.mapreduce.Job.updateStatus(Job.java:323)
 DEBUG - stopping client from cache: [email protected]
 DEBUG - removing client from cache: [email protected]
 DEBUG - stopping actual client because no more references remain: [email protected]
 DEBUG - Stopping client
 DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: closed
 DEBUG - IPC Client (521081105) connection to /192.168.58.180:8020 from hxsyl: stopped, remaining connections 0

时间： 2024-10-07 06:27:11

MapReduce实现数据去重的相关文章

利用MapReduce实现数据去重

数据去重主要是为了利用并行化的思想对数据进行有意义的筛选. 统计大数据集上的数据种类个数.从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重. 示例文件内容: 此处应有示例文件设计思路数据去重的最终目标是让原始数据中出现次数超过一次的数据在输出文件中只出现一次. 自然就想到将同一数据的所有记录都交给一台reduce机器,无路这个数据出现多少次,只要在最终结果中输出一次就可以了. 具体就是reduce的输入应该以数据作为key,而对value-list没有要求. 当reduce收到一个

Hadoop mapreduce 数据去重数据排序小例子

数据去重: 数据去重,只是让出现的数据仅一次,所以在reduce阶段key作为输入,而对于values-in没有要求,即输入的key直接作为输出的key,并将value置空.具体步骤类似于wordcount: Tip:输入输出路径配置. import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop

MapReduce编程之数据去重

数据去重主要是为了掌握和利用并行化思想来对数据进行有意义的筛选.统计大数据集上的数据种类个数.从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重.下面就进入这个实例的MapReduce程序设计. package com.hadoop.mr; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.

用一个MapReduce job实现去重，多目录输出功能

总结之前工作中遇到的一个问题. 背景: 运维用scribe从apache服务器推送过来的日志有重复记录,所以这边的ETL处理要去重,还有个需求是要按业务类型多目录输出,方便挂分区,后面的使用. 这两个需求单独处理都没有问题,但要在一个mapreduce里完成,需要一点技巧. 1.map输入数据,经过一系列处理,输出时: if(ttype.equals("other")){ file = (result.toString().hashCode() & 0x7FFFFFFF)%40

hadoop数据去重

"数据去重"主要是为了掌握和利用并行化思想来对数据进行有意义的筛选.统计大数据集上的数据种类个数.从网站日志中计算访问地等这些看似庞杂的任务都会涉及数据去重.下面就进入这个实例的MapReduce程序设计. 1.1 实例描述对数据文件中的数据进行去重.数据文件中的每行都是一个数据. 样例输入如下所示: 1.txt 内容 1 1 2 2 2.txt内容 4 4 3 3 样例输出如下: 1 2 3 4 将测试数据上传到hdfs上/input目录下,同时确保输出目录不存在. 1.2 设计思

Hadoop 数据去重

数据去重 1.原始数据 1)file1: 2012-3-1 a 2012-3-2 b 2012-3-3 c 2012-3-4 d 2012-3-5 a 2012-3-6 b 2012-3-7 c 2012-3-3 c 2)file2: 2012-3-1 b 2012-3-2 a 2012-3-3 b 2012-3-4 d 2012-3-5 a 2012-3-6 c 2012-3-7 d 2012-3-3 c 数据输出: 2012-3-1 a 2012-3-1 b 2012-3-2 a 2012-

数据去重2---高性能重复数据检测与删除技术研究一些零碎的知识

高性能重复数据检测与删除技术研究这里介绍一些零碎的有关数据重删的东西,以前总结的,放上可以和大家交流交流. 1 数据量的爆炸增长对现有存储系统的容量.吞吐性能.可扩展性.可靠性.安全性. 可维护性和能耗管理等各个方面都带来新的挑战, 消除冗余信息优化存储空间效率成为缓解存储容量瓶颈的重要手段,现有消除信息冗余的主要技术包括数据压缩[8]和数据去重. 2 数据压缩是通过编码方法用更少的位( bit)表达原始数据的过程,根据编码过程是否损失原始信息量,又可将数据压缩细分为无损压缩和有损压缩.

【问题整理】MySQL大量数据去重处理

由于工作中需要进行数据去重,所以做一下记录,其实是很小白的问题.... 其实对于数据去重来讲,最好的是在设计程序和数据库的时候就考虑到数据冗余问题,不插入重复的数据.但是呢,,,这个项目,如果其中的两个字段同时重复,就算冗余,但是还需要自增长的id作为主键方便查询....so...算了,我写完数据自己去重吧... 因为有大量的重复数据,所以选择的去重方法是通过聚合函数建立一个新的表,然后重命名.sql代码如下: create table tmp select * from table_name

脚本实现数据去重

如图所示: 代码: <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <script src="jquery.js"></script> <script type="text/javascript"> $(function(){ var divArea = $(".drop"