Hadoop 学习笔记3 Develping MapReduce

小笔记:

Mavon是一种项目管理工具,通过xml配置来设置项目信息。

Mavon POM(project of model).

Steps:

1. set up and configure the development environment.

2. writing your map and reduce functions and run them in local (standalone) mode from the command line or within your IDE.

3. unit test --> test on small dataset --> test on the full dataset after unleash in a cluster

--> tuning

1. Configuration API

  • Components in Hadoop are configured using Hadoop’s own configuration API.
  • org.apache.hadoop.conf package
  • Configurations read their properties from resources — XML files with a simple structure for defining name-value pairs.

For example, write a configuration-1.xml like:

<?xml version="1.0"?>
<configuration>
  <property>
     <name>color</name>
     <value>yellow</value>
     <description>Color</description>
  </property>
  <property>
     <name>size</name>
     <value>10</value>
     <description>Size</description>
  </property>
  <property>
     <name>weight</name>
     <value>heavy</value>
     <final>true</final>
     <description>Weight</description>
  </property>
  <property>
     <name>size-weight</name>
     <value>${size},${weight}</value>
     <description>Size and weight</description>
  </property>
</configuration>

then access it by coding below:

Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");conf.addResource("configuration-2.xml");    // more than one resource are added orderly, and the latter will overwrite the former.

assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));

Note:

  • type information is not stored in the XML file;
  • instead, properties can be interpreted as a given type when they are read.
  • Also, the get() methods allow you to specify a default value, which is used if the property is not defined in the XML file, as in the case of breadth here.
  • more than one resource are added orderly, and the latter properties will overwrite the former.
  • However, properties that are marked as final cannot be overridden in later definitions.
  • system properties take priority:
System.setProperty("size", "14")
  • Options specified with -D take priority over properties from the configuration files.

This will override the number of reducers set on the cluster or set in any client-side configuration files.

% hadoop ConfigurationPrinter -D color=yellow | grep color

  

2. Set up dev enviroment

The Maven POMs (Project Object Model) are used to show the dependencies needed for building and testing MapReduce programs. Actually a xml file.

  • hadoop-client dependency, which contains all the Hadoop client-side classes needed to interact with HDFS and MapReduce.
  • For running unit tests, we use junit,
  • for writing MapReduce tests, we use mrunit.
  • The hadoop-minicluster library contains the “mini-” clusters that are useful for testing with Hadoop clusters running in a single JVM.

Many IDEs can read Maven POMs directly, so you can just point them at the directory containing the pom.xml file and start writing code.

Alternatively, you can use Maven to generate configuration files for your IDE. For example, the following creates Eclipse configuration files so you can import the project into Eclipse:

% mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true

3. Managing switching

It is common to switch between running the application locally and running it on a cluster.

  • have Hadoop configuration files containing the connection settings for each cluster
  • we assume the existence of a directory called conf that contains three configuration files: hadoop-local.xml, hadoop-localhost.xml, and hadoopcluster.xml
  • For example, the following command shows a directory listing on the HDFS serverrunning in pseudodistributed mode on localhost:

- conf

% hadoop  fs  -conf  conf/hadoop-localhost.xml  -ls

Found 2 items
drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 input
drwxr-xr-x - tom supergroup 0 2014-09-08 10:19 output

4.  Starts MapReduce example:

Mapper:  to get year and temperature from an input string

public class MaxTemperatureMapper
     extends Mapper<LongWritable, Text, Text, IntWritable> {

@Override
public void map(LongWritable key, Text value, Context context)
     throws IOException, InterruptedException {
       String line = value.toString();
       String year = line.substring(15, 19);
       int airTemperature = Integer.parseInt(line.substring(87, 92));

       context.write(new Text(year), new IntWritable(airTemperature));
      }
}

Unit test for the Mapper:

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;

public class MaxTemperatureMapperTest {
   @Test
   public void processesValidRecord() throws IOException, InterruptedException {
        Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
                                      // Year ^^^^
                "99999V0203201N00261220001CN9999999N9-00111+99999999999");
                                      // Temperature ^^^^^

       new MapDriver<LongWritable, Text, Text, IntWritable>()
         .withMapper(new MaxTemperatureMapper())
         .withInput(new LongWritable(0), value)
         .withOutput(new Text("1950"), new IntWritable(-11))
         .runTest();
    }
}

Reducer:  to get the maxmium

public class MaxTemperatureReducer
   extends Reducer<Text, IntWritable, Text, IntWritable> {

   @Override
   public void reduce(Text key, Iterable<IntWritable> values, Context context)
             throws IOException, InterruptedException {

      int maxValue = Integer.MIN_VALUE;

      for (IntWritable value : values) {
           maxValue = Math.max(maxValue, value.get());
       }

      context.write(key, new IntWritable(maxValue));
    }
}

Unit test for the Reducer:

@Test
public void returnsMaximumIntegerInValues() throws IOException, InterruptedException {

   new ReduceDriver<Text, IntWritable, Text, IntWritable>()
       .withReducer(new MaxTemperatureReducer())
       .withInput(new Text("1950"),
                       Arrays.asList(new IntWritable(10), new IntWritable(5)))
       .withOutput(new Text("1950"), new IntWritable(10))
       .runTest();
}

5 . a write job driver

Using the Tool interface , it’s easy to write a driver to run a MapReduce job.

Then run the driver locally.

% mvn compile
% export HADOOP_CLASSPATH=target/classes/
% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml input/ncdc/micro output

% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output

The local job runner uses a single JVM to run a job, so as long as all the classes that your job needs are on its classpath, then things will just work.

6. Running on a cluster

  • a job’s classes must be packaged into a job JAR file to send to the cluster

 

 

时间: 2024-10-28 05:33:09

Hadoop 学习笔记3 Develping MapReduce的相关文章

Hadoop学习笔记(2) 关于MapReduce

1. 查找历年最高的温度. MapReduce任务过程被分为两个处理阶段:map阶段和reduce阶段.每个阶段都以键/值对作为输入和输出,并由程序员选择它们的类型.程序员还需具体定义两个函数:map函数和reduce函数. 对应的Java MapReduce代码如下: public class MaxTemperature{ static class MaxTemperatureMapper extends Mapper<LongWritable,Text,Text,IntWritable>

Hadoop学习笔记—4.初识MapReduce

一.神马是高大上的MapReduce MapReduce是Google的一项重要技术,它首先是一个编程模型,用以进行大数据量的计算.对于大数据量的计算,通常采用的处理手法就是并行计算.但对许多开发者来说,自己完完全全实现一个并行计算程序难度太大,而MapReduce就是一种简化并行计算的编程模型,它使得那些没有多有多少并行计算经验的开发人员也可以开发并行应用程序.这也就是MapReduce的价值所在,通过简化编程模型,降低了开发并行应用的入门门槛. 1.1 MapReduce是什么 Hadoop

Hadoop 学习笔记 (2) -- 关于MapReduce

1. MapReduce    定义:    是一种可用于数据处理的编程的模型    优势:    MapReduce 本质上是并行运行的,因此可以将大规模的数据分析任务,分发给任何一个拥有足够多机器的    的数据中心.    MapReduce 的优势在于处理大规模数据集.    过程: (map 和 reduce)        每个阶段都已 键值对 作为输入和输出    图例:        map 函数           |           |    MapReduce 框架处理

hadoop 学习笔记:mapreduce框架详解

hadoop 学习笔记:mapreduce框架详解 开始聊mapreduce,mapreduce是hadoop的计算框架,我 学hadoop是从hive开始入手,再到hdfs,当我学习hdfs时候,就感觉到hdfs和mapreduce关系的紧密.这个可能是我做技术研究的 思路有关,我开始学习某一套技术总是想着这套技术到底能干什么,只有当我真正理解了这套技术解决了什么问题时候,我后续的学习就能逐步的加快,而学习 hdfs时候我就发现,要理解hadoop框架的意义,hdfs和mapreduce是密不

Hadoop学习笔记(6) ——重新认识Hadoop

Hadoop学习笔记(6) ——重新认识Hadoop 之前,我们把hadoop从下载包部署到编写了helloworld,看到了结果.现是得开始稍微更深入地了解hadoop了. Hadoop包含了两大功能DFS和MapReduce, DFS可以理解为一个分布式文件系统,存储而已,所以这里暂时就不深入研究了,等后面读了其源码后,再来深入分析. 所以这里主要来研究一下MapReduce. 这样,我们先来看一下MapReduce的思想来源: alert("I'd like some Spaghetti!

Hadoop学习笔记(7) ——高级编程

Hadoop学习笔记(7) ——高级编程 从前面的学习中,我们了解到了MapReduce整个过程需要经过以下几个步骤: 1.输入(input):将输入数据分成一个个split,并将split进一步拆成<key, value>. 2.映射(map):根据输入的<key, value>进生处理, 3.合并(combiner):合并中间相两同的key值. 4.分区(Partition):将<key, value>分成N分,分别送到下一环节. 5.化简(Reduce):将中间结

Hadoop学习笔记(8) ——实战 做个倒排索引

Hadoop学习笔记(8) ——实战 做个倒排索引 倒排索引是文档检索系统中最常用数据结构.根据单词反过来查在文档中出现的频率,而不是根据文档来,所以称倒排索引(Inverted Index).结构如下: 这张索引表中, 每个单词都对应着一系列的出现该单词的文档,权表示该单词在该文档中出现的次数.现在我们假定输入的是以下的文件清单: T1 : hello world hello china T2 : hello hadoop T3 : bye world bye hadoop bye bye 输

Hadoop学习笔记_2_Hadoop源起与体系概述[续]

Hadoop源起与体系概述 Hadoop的源起--Lucene Lucene是Doug Cutting开创的开源软件,用java书写代码,实现与Google类似的全文搜索功能,它提供了全文检索引擎的架构,包括完整的查询引擎和索引引擎 早期发布在个人网站和SourceForge,2001年年底成为apache软件基金会jakarta的一个子项目 Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎 对于大数据的

Hadoop学习笔记(4) ——搭建开发环境及编写Hello World

Hadoop学习笔记(4) ——搭建开发环境及编写Hello World 整个Hadoop是基于Java开发的,所以要开发Hadoop相应的程序就得用JAVA.在linux下开发JAVA还数eclipse方便. 下载 进入官网:http://eclipse.org/downloads/. 找到相应的版本进行下载,我这里用的是eclipse-SDK-3.7.1-linux-gtk版本. 解压 下载下来一般是tar.gz文件,运行: $tar -zxvf eclipse-SDK-3.7.1-linu