hadoop wordcount

Mapper

// map的数量与数的分片有关系
public class WCMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

	protected void map(LongWritable key, Text value, Context context)
			throws IOException, InterruptedException {
		String line = value.toString();
		String[] words = StringUtils.split(line, " ");
		for (String word : words) {
			context.write(new Text(word), new LongWritable(1));
		}
	}
}

Reducer　

public class WCReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

	@Override
	protected void reduce(Text key, Iterable<LongWritable> values, Context context)
			throws IOException, InterruptedException {
		long count = 0;
		for (LongWritable l : values) {
			count ++;
		}
		context.write(key, new LongWritable(count));
	}
}

Runner

public class WCRunner {

	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		job.setJarByClass(WCRunner.class);

		job.setMapperClass(WCMapper.class);
		job.setReducerClass(WCReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(LongWritable.class);

		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(LongWritable.class);
// 设置reduce的数量，对应的会生成设置数量的文件，每个文件的内容是根据
// job.setPartitionerClass(HashPartitioner.class);中的Partitioner确定

                job.setNumReduceTasks(10);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

public class WCRunner2 extends Configured implements Tool{

	public int run(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		job.setJarByClass(WCRunner2.class);

		job.setMapperClass(WCMapper.class);
        job.setReducerClass(WCReducer.class);	

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

		return job.waitForCompletion(true) ? 0 : 1;
	}

	public static void main(String[] args) throws Exception {
		ToolRunner.run(new WCRunner2(), args);
	}

}

执行: hadoop jar wc.jar com.easytrack.hadoop.mr.WCRunner2 /wordcount.txt /wc/output4

时间： 2024-10-29 19:07:15

hadoop wordcount的相关文章

hadoop wordCount运行

本文以康哥的博客为基础进行修改和补充,详见:http://kangfoo.github.io/article/2014/01/hadoop1.x-wordcount-fen-xi/ hadoop mapreduce 过程粗略的分为两个阶段: 1.map; 2.redurce(copy, sort, reduce) 具体的工作机制还是挺复杂的,这里主要通过hadoop example jar中提供的wordcount来对hadoop mapredurce做个简单的理解. Wordcount程序输入

hadoop wordcount程序缺陷

在wordcount 程序的main函数中,没有读取运行环境中的各种参数的值,全靠hadoop系统的默认参数跑起来,这样做是有风险的,最突出的就是OOM错误. 自己在刚刚学习hadoop编程时,就是模仿wordcount程序编写.在数据量很小,作为demo程序跑,不会有什么问题,但当数据量激增,变成以亿计算时,各种问题都会出现. 所以一定要在main函数中,增加下面的代码,让程序去读取环境配置文件,得到你希望要的参数. Configuration.addDefaultResource("hdfs

Linux下执行Hadoop WordCount.jar

Linux执行 Hadoop WordCount ubuntu 终端进入快捷键 :ctrl + Alt +t hadoop启动命令:start-all.sh 正常执行效果如下: [email protected]:~$ start-all.sh Warning: $HADOOP_HOME is deprecated. starting namenode, logging to /home/hadoop/hadoop-1.1.2/libexec/../logs/hadoop-hadoop-name

hadoop wordcount入门

配置 ubuntu14.04 伪分布式 hadoop1.04 wordcount入门程序, 摘自hadoop基础教程 import java.io.*; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.

Eclipse执行Hadoop WordCount

前期工作我的Eclipse是安装在Windows下的,通过Eclipse执行程序连接Hadoop, 需要让虚拟机的访问地址和本机的访问地址保持在同一域内,虚拟机的地址更改前面的文章介绍过了,如果想改windows本机ip地址,打开“网络和共享中心“,点击左侧菜单”更改适配器设置“,选择相应连接网络进行IpV4属性地址修改即可.我虚拟机地址为192.168.3.137 准备工作地址配置好之后,在Eclipse上要安装Hadoop的插件(你可以参考源码自行修改). 打开Eclipse安装路径-

hadoop wordcount卡住

下面是两个安装hadoop时,使用的参考网站: 1.hadoop1.2.1伪分布模式安装教程:http://www.cnblogs.com/fabulousyoung/p/4074197.html 2.Ubuntu 12.10 +Hadoop 1.2.1版本集群配置:http://blog.csdn.net/xjavasunjava/article/details/12013677 安装完hadoop后,运行hadoop自带的程序wordcount时,出现卡住的问题,等再长的时间也无法计算完.

hadoop wordcount异常

最近学习hadoop,在windows+eclipse+虚拟机hadoop集群环境下运行mapreduce程序遇到了很多问题.上网查了查,并经过自己的分析,最终解决,在此分享一下,给遇到同样问题的人提供参考. 我的hadoop集群环境: 虚拟机上4台机器:192.168.137.111(master).192.168.137.112(slave1).192.168.137.113(slave2).192.168.137.114(slave3) hadoop集群用户名:hadoop hadoop版

Hadoop wordcount Demon

搭建完成Hadoop后,第一个demon,wordcount.此处参考:http://blog.csdn.net/wangjia55/article/details/53160679 wordcount是hadoop的入门经典. 1.在某个目录下新建若干文件,我在各个文件里都添加了一些英文文章段落: 2．在hadoop-2.7.3目录下创建一个wordcountTest目录: bin/hdfs dfs -mkdir /wordcountTest 查看刚才建立的目录: bin/hdfs dfs -

Hadoop WordCount详解（二）

Hadoop集群WordCount详解(二) 源代码程序 WordCount处理过程具体代码讲解 1.源代码程序 package org.apache.hadoop.examples; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.examples.WordCount.Token