Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）

声明：

　　1）本文由我bitpeach原创撰写，转载时请注明出处，侵权必究。

2）本小实验工作环境为Windows系统下的百度云（联网），和Ubuntu系统的hadoop1-2-1（自己提前配好）。如不清楚配置可看《Hadoop之词频统计小实验初步配置》

3）本文由于过长，无法一次性上传。其相邻相关的博文，可参见《Hadoop的改进实验（中文分词词频统计及英文词频统计）
博文目录结构》，以阅览其余三篇剩余内容文档。

（五）单机伪分布的英文词频统计Python&Streaming

Python与Streaming背景

Python与Streaming

背景：Python程序也可以运用至hadoop中，但不可以使用MapReduce框架，只可以使用Streaming模式借口，该接口专为非java语言提供接口，如C，shell脚本等。

1）单机本机

Hadoop
0.21.0之前的版本中的Hadoop Streaming工具只支持文本格式的数据，而从Hadoop 0.21.0开始，也支持二进制格式的数据。hadoop
streaming调用非java程序的格式接口为：

Usage:
$HADOOP_HOME/bin/hadoop jar \

$HADOOP_HOME/contrib/streaming/hadoop-*-streaming.jar
[options]

其Options选项大致为：

（1）-input：输入文件路径

（2）-output：输出文件路径

（3）-mapper：用户自己写的mapper程序，可以是可执行文件或者脚本

（4）-reducer：用户自己写的reducer程序，可以是可执行文件或者脚本

（5）-file：打包文件到提交的作业中，可以是mapper或者reducer要用的输入文件，如配置文件，字典等。

（6）-partitioner：用户自定义的partitioner程序

（7）-combiner：用户自定义的combiner程序（必须用java实现）

（8）-D：作业的一些属性（以前用的是-jonconf）

举个例子，具体可以是：

$HADOOP_HOME/bin/hadoop
jar \

contrib/streaming/hadoop-0.20.2-streaming.jar
\

-input
input \

-ouput
output \

-mapper
mapper.py \

-reducer
reducer.py \

-file
mapper.py \

-file
reducer.py \

2）百度开放云

百度开放云很是方便，方便在于提供好了streaming的模式接口，如果需要本机提供此接口，需要将调用hadoop里的streaming.jar包，其次格式非常麻烦，有时总会不成功。不如百度开放云使用方便，当然了物有两面，百度开放云对于中文处理，显示总是乱码，故处理中文类，还是需要单机下的hadoop平台。

当然了，和单机下一样，至少你要写好两个python脚本，一个负责mapper，一个负责reducer，然后接下来后续步骤。

百度开放云提供的接口是：
hadoop jar
$hadoop_streaming –input Input –output Output –mapper "python mapper.py"
–reducer "python reducer.py" –file mapper.py –file reducer.py

只要环境做好，非常好用，直接成功。

Python英文词频统计实验

实验过程

背景：Python程序也可以运用至hadoop中，但不可以使用MapReduce框架，只可以使用Streaming模式借口，该接口专为非java语言提供接口，如C，shell脚本等。

下面的步骤均是在百度开放云上进行操作的，如需在本机上操作，原理是一样的，命令也基本相同的。

1）准备数据

先打算处理简单文本，因此上传了三个简单的英文单词文本。如下图所示，我们可以看到文本里的内容。

然后，我们要开始准备python脚本，下表可看两个脚本的内容。

#
Mapper.py

#!/usr/bin/env
python

import
sys

#
maps words to their counts

word2count
= {}

#
input comes from STDIN (standard input)

for
line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# split the line into words while removing any empty strings

words = filter(lambda word: word, line.split())

# increase counters

for word in words:

# write the results to STDOUT (standard output);

# what we output here will be the input for the

# Reduce step, i.e. the input for reducer.py

# tab-delimited; the trivial word count is 1

print ‘%s\t%s‘ % (word, 1)

#
Reducer.py

#!/usr/bin/env
python

from
operator import itemgetter

import
sys

#
maps words to their counts

word2count
= {}

#
input comes from STDIN

for
line in sys.stdin:

# remove leading and trailing whitespace

line = line.strip()

# parse the input we got from mapper.py

word, count = line.split()

# convert count (currently a string) to int

try:

count = int(count)

word2count[word] = word2count.get(word, 0) + count

except ValueError:

# count was not a number, so silently

# ignore/discard this line

pass

#
sort the words lexigraphically;

#
this step is NOT required, we just do it so that our

#
final output will look more like the official Hadoop

#
word count examples

sorted_word2count
= sorted(word2count.items(), key=itemgetter(0))

#
write the results to STDOUT (standard output)

for
word, count in sorted_word2count:

print ‘%s\t%s‘% (word, count)

接着，上传两个脚本，并执行指令：

hadoop
jar $hadoop_streaming -input Input -output Output -mapper "python
mapper.py" -reducer "python reducer.py" -file mapper.py
-file reducer.py

工作状态的示意图如下图所示：

最后出现结果，结果如图所示。

至此，streaming模式的英文词频统计实验结束。

Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）,布布扣,bubuko.com

时间： 2024-12-24 13:15:37

Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）

（五）单机伪分布的英文词频统计Python&Streaming

Python与Streaming背景

1）单机本机

2）百度开放云

Python英文词频统计实验

1）准备数据

Hadoop的改进实验（中文分词词频统计及英文词频统计）（4/4）的相关文章

hadoop中文分词、词频统计及排序

实验二-2 Eclipse&Hadoop 做英文词频统计进行集群测试

在Hadoop上运行基于RMM中文分词算法的MapReduce程序

Spark 大数据中文分词统计（三） Scala语言实现分词统计

利用word分词来对文本进行词频统计

深入浅出Hadoop Mahout数据挖掘实战(算法分析、项目实战、中文分词技术)

词频统计英文和统计中文的区别

下载深入浅出Hadoop Mahout数据挖掘实战(算法分析、项目实战、中文分词技术)

1.英文词频统2.中文词频统计