mapreduce程序编写(WordCount)

折腾了半天。终于编写成功了第一个自己的mapreduce程序，并通过打jar包的方式运行起来了。

运行环境：

windows 64bit

eclipse 64bit

jdk6.0 64bit

一、工程准备

1、新建java project

2、导入jar包

新建一个user library 把hadoop文件夹里的hadoop-core和lib包里的所有包都导入进来，以免出错。

二、编码

1、主要是计算单词的小程序，测试用

package com.hirra;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
    //嵌套类 Mapper
    //Mapper<keyin,valuein,keyout,valueout>
    public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>{
        private final static IntWritable one = new IntWritable(1);
        private Text word = new Text();  

        @Override
        protected void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            StringTokenizer itr = new StringTokenizer(value.toString());
            while(itr.hasMoreTokens()){
                word.set(itr.nextToken());
                context.write(word, one);//Context机制
            }
        }
    }  

    //嵌套类Reducer
    //Reduce<keyin,valuein,keyout,valueout>
    //Reducer的valuein类型要和Mapper的va lueout类型一致,Reducer的valuein是Mapper的valueout经过shuffle之后的值
    public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
        private IntWritable result = new IntWritable();  

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values,
                Context context)
                throws IOException, InterruptedException {
            int sum  = 0;
            for(IntWritable i:values){
                sum += i.get();
            }
            result.set(sum);
            context.write(key,result);//Context机制
        }  

    }  

    public static void main(String[] args) throws Exception{
        Configuration conf = new Configuration();//获得Configuration配置 Configuration: core-default.xml, core-site.xml　　　　　　//很关键　　　　 conf.set("mapred.job.tracker", "hadoopmaster:9001");
　　　　String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();//获得输入参数[hdfs://localhost:9000/user/dat/input, hdfs://localhost:9000/user/dat/output]
        if(otherArgs.length != 2){//判断输入参数个数，不为两个异常退出
            System.err.println("Usage:wordcount <in> <out>");
            System.exit(2);
        }  

        ////设置Job属性
        Job job = new Job(conf,"word count");
        job.setJarByClass(WordCount.class);
        job.setMapperClass(WordCountMapper.class);
        job.setCombinerClass(WordCountReducer.class);//将结果进行局部合并
        job.setReducerClass(WordCountReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);  

        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//传入input path
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//传入output path，输出路径应该为空，否则报错org.apache.hadoop.mapred.FileAlreadyExistsException。  

        System.exit(job.waitForCompletion(true)?0:1);//是否正常退出
    }
}

2、注意问题

有些jar包没导入会出现问题

三、生成jar包

1、eclipse自带功能export jar包

四、运行

1、ssh client工具导入至linux

2、hadoop运行,转到hadoop的bin目录下，执行下面指令:

./hadoop jar test.jar /README.txt /usr/dat/output

3、注意问题

output目录必须是之前不存在的路径。

时间： 2024-08-26 10:14:28

mapreduce程序编写(WordCount)的相关文章

为Hadoop的MapReduce程序编写makefile

最近需要把基于hadoop的MapReduce程序集成到一个大的用C/C++编写的框架中,需要在make的时候自动将MapReduce应用进行编译和打包.这里以简单的WordCount1为例说明具体的实现细节,注意:hadoop版本为2.4.0. 源代码包含两个文件,一个是WordCount1.java是具体的对单词计数实现的逻辑:第二个是CounterThread.java,其中简单的当前处理的行数做一个统计和打印.代码分别见附1. 编写makefile的关键是将hadoop提供的jar包的路

Hadoop学习之路(5)Mapreduce程序完成wordcount

程序使用的测试文本数据: Dear River Dear River Bear Spark Car Dear Car Bear Car Dear Car River Car Spark Spark Dear Spark 1编写主要类 (1)Maper类首先是自定义的Maper类代码 public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWrit

用Python编写WordCount程序任务

1. 用Python编写WordCount程序并提交任务程序 WordCount 输入一个包含大量单词的文本文件输出文件中每个单词及其出现次数(频数),并按照单词字母顺序排序,每个单词和其频数占一行,单词和频数之间有间隔 2.编写map函数,reduce函数 import sys for line in sys.stdin: line=line.strip() words=line.split() for word in words: print '%s\t%s' % (word,1)

MapReduce实例：编写MapReduce程序，统计每个买家收藏商品数量

现有某电商网站用户对商品的收藏数据,记录了用户收藏的商品id以及收藏日期,名为buyer_favorite1. buyer_favorite1包含:买家id,商品id,收藏日期这三个字段,数据以“\t”分割,样本数据及格式如下: 买家id 商品id 收藏日期 10181 1000481 2010-04-04 16:54:31 20001 1001597 2010-04-07 15:07:52 20001 1001560 2010-04-07 15:08:27 2

[译]下一代的Hadoop Mapreduce – 如何编写YARN应用程序

1. [译]下一代的Hadoop Mapreduce – 如何编写YARN应用程序 http://www.rigongyizu.com/hadoop-mapreduce-next-generation-writing-yarn-applications/

MapReduce程序（一）——wordCount

写在前面:WordCount的功能是统计输入文件中每个单词出现的次数.基本解决思路就是将文本内容切分成单词,将其中相同的单词聚集在一起,统计其数量作为该单词的出现次数输出. 1.MapReduce之wordcount的计算模型 1.1 WordCount的Map过程假设有两个输入文本文件,输入数据经过默认的LineRecordReader被分割成一行行数据,再经由map()方法得到<key, value>对,Map过程如下: 得到map方法输出的< key,value>对后,Ma

如何在maven项目里面编写mapreduce程序以及一个maven项目里面管理多个mapreduce程序

我们平时创建普通的mapreduce项目,在遍代码当你需要导包使用一些工具类的时候, 你需要自己找到对应的架包,再导进项目里面其实这样做非常不方便,我建议我们还是用maven项目来得方便多了话不多说了,我们就开始吧首先你在eclipse里把你本地安装的maven导进来选择你本地安装的maven路径勾选中你添加进来的maven 把本地安装的maven的setting文件添加进来接下来创建一个maven项目可以看到一个maven项目创建成功!! 现在我们来配置pom.xml文件,把map

在Pycharm上编写WordCount程序

本篇博客将给大家介绍怎么在PyCharm上编写运行WordCount程序. 第一步下载安装PyCharm 下载Pycharm PyCharm的下载地址(Linux版本).下载完成后你将得到一个名叫:pycharm-professional-2018.2.4.tar.gz文件.我们选择的是正版软件,学生可申请免费使用.详细信息请百度. 安装PyCharm 执行以下命令解压文件: cd ~/下载 tar -xvf pycharm-professional-2018.2.4.tar.gz Shell

hive--构建于hadoop之上、让你像写SQL一样编写MapReduce程序

hive介绍什么是hive? hive:由Facebook开源用于解决海量结构化日志的数据统计 hive是基于hadoop的一个数据仓库工具,可以将结构化的数据映射为数据库的一张表,并提供类SQL查询功能.本质就是将HQL(hive sql)转化为MapReduce程序我们使用MapReduce开发会很麻烦,但是程序员很熟悉sql,于是hive就出现了,可以让我们像写sql一样来编写MapReduce程序,会自动将我们写的sql进行转化.但底层使用的肯定还是MapReduce. hive处理