MapReduce编程系列 — 1：计算单词

1、代码：

package com.mrdemo;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();
    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    //conf.set("fs.defaultFS", "hdfs://192.168.6.77:9000");
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

2、准备测试数据。

在hdfs中新建一个input01文件夹，然后将/home/hadoop/Documents文件夹下新建的hello文件上传到hdfs中的input01文件夹中。

测试数据：

hello world!
hello hadoop
jobtracker
maptracker
reducetracker
task
namenode
datanode
block
beautiful world
hadoop:
HDFS
MapReduce

3、命令

[email protected]:~$ hadoop fs -mkdir input01

[email protected]:~$ cd /home/hadoop/Documents

[email protected]:~/Documents$ hadoop fs -copyFromLocal hello input01

hdfs://localhost:9000/user/yyq/input01
hdfs://localhost:9000/user/yyq/output01

4、配置运行参数

Run As → Run Configurations… ，在Arguments中配置运行参数，例如程序的输入参数：

时间： 2024-12-22 22:25:47

MapReduce编程系列 — 1：计算单词的相关文章

MapReduce编程系列 — 2：计算平均分

1.项目名称: 2.程序代码: package com.averagescorecount; import java.io.IOException; import java.util.Iterator; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWrit

MapReduce 编程系列五 MapReduce 主要过程梳理

前面4篇文章介绍了如何编写一个简单的日志提取程序,读取HDFS share/logs目录下的所有csv日志文件,然后提取数据后,最终输出到share/output目录下. 本篇停留一下,梳理一下主要过程,然后提出新的改进目标. 首先声明一下,所有的代码都是maven工程的,没有使用任何IDE. 这是我一贯的编程风格,用Emacs + JDEE开发.需要使用IDE的只需要学习如何在IDE中使用maven即可. 可比较的序列化第一个是序列化,这是各种编程技术中常用的.MapReduce的特别之处

MapReduce 编程系列八根据输入路径产生输出路径和清除HDFS目录

有了前面的MultipleOutputs的使用经验,就可以将HDFS输入目录的路径解析出来,组成输出路径,这在业务上是十分常用的.这样其实是没有多文件名输出,仅仅是调用了MultipleOutputs的addNamedOutput方法一次,设置文件名为result. 同时为了保证计算的可重入性,每次都需要将已经存在的输出目录删除. 先看pom.xml, 现在参数只有一个输入目录了,输出目录会在该路径后面自动加上/output. <project xmlns="http://maven.ap

MapReduce 编程系列十一 Map阶段的调优

MapOutputBuffer 对于每一个Map,都有一个内存buffer用来缓存中间结果,这不仅可以缓存,而且还可以用来排序,被称为MapOutputBuffer, 设置这个buffer大小的配置是 io.sort.mb 默认值是100MB. 一般当buffer被使用到一定比例,就会将Map的中间结果往磁盘上写,这个比例的配置是: io.sort.spill.percent 默认值是80%或者0.8. 在内存中排序缓存的过程叫做sort,而当超过上面的比例在磁盘上写入中间结果的过程称之为spi

MapReduce 编程系列十二用Hadoop Streaming技术集成newLISP脚本

本文环境和之前的Hadoop 1.x不同,是在Hadoop 2.x环境下测试.功能和前面的日志处理程序一样. 第一个newLISP脚本,起到mapper的作用,在stdin中读取文本数据,将did作为key, value为1,然后将结果输出到stdout 第二个newLISP脚本,起到reducer的作用,在stdin中读取<key, values>, key是dic, values是所有的value,简单对value求和后,写到stdout中最后应该可以在HDFS下看到结果. 用脚本编程的

MapReduce 编程系列七 MapReduce程序日志查看

首先,如果需要打印日志,不需要用log4j这些东西,直接用System.out.println即可,这些输出到stdout的日志信息可以在jobtracker站点最终找到. 其次,如果在main函数启动的时候用System.out.println打印的日志,直接在控制台就可以看到. 再其次,jobtracker站点很重要. http://your_name_node:50030/jobtracker.jsp 注意,在这里看到Map 100%不一定正确,有时候会卡在Map阶段并没有完成,而此时居然

MapReduce 编程系列六 MultipleOutputs使用

在前面的例子中,输出文件名是默认的: _logs part-r-00001 part-r-00003 part-r-00005 part-r-00007 part-r-00009 part-r-00011 part-r-00013 _SUCCESS part-r-00000 part-r-00002 part-r-00004 part-r-00006 part-r-00008 part-r-00010 part-r-00012 part-r-00014 part-r-0000N 还有一个_SUC

MapReduce 编程系列九 Reducer数目

本篇介绍怎样控制reduce的数目.前面观察结果文件,都会发现通常是以part-r-00000 形式出现多个文件,事实上这个reducer的数目有关系.reducer数目多,结果文件数目就多. 在初始化job的时候.是能够设置reducer的数目的.example4在example的基础上做了改动.改动了pom.xml.使得结束一个參数作为reducer的数目.改动了LogJob.java的代码,作为设置reducer数目. xsi:schemaLocation="http://maven.ap

MapReduce 编程系列四 MapReduce例子程序运行

MapReduce程序编译是可以在普通的Java环境下进行,现在来到真实的环境上运行. 首先,将日志文件放到HDFS目录下 $ hdfs dfs -put *.csv /user/chenshu/share/logs/ 14/09/27 17:03:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where app