前面我们学习了MapReduce编程思想和编程示例,那么本节课程同学们一起操练操练,动手完成下面的项目。
项目需求
一本英文书籍包含成千上万个单词或者短语,现在我们需要在大量的单词中,找出相同字母组成的所有anagrams(字谜)。
数据集
下面是一本英文书籍截取的一部分单词内容。猛戳此链接下载数据集
initiate initiated initiates initiating initiation initiations initiative initiatives initiator initiators initiatory inject injectant injected injecting injection injections injector injectors injects
思路分析
基于以上需求,我们通过以下几步完成:
1、在 Map 阶段,对每个word(单词)按字母进行排序生成sortedWord,然后输出key/value键值对(sortedWord,word)。
2、在 Reduce 阶段,统计出每组相同字母组成的所有anagrams(字谜)。
数据处理示意流程
在下面单词中,找出相同字母组成的字谜。
cat tar bar act rat
第一步:经过 map 阶段处理
< act cat > < art tar> < abr bar> < act act> < art rat>
第二步:经过 reduce 阶段处理
< abr bar> < act cat,act> < art tar,rat>
程序开发
1、编写程序执行主类:AnagramMain
package com.hadoop.test; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class AnagramMain extends Configured implements Tool{ @SuppressWarnings( "deprecation") @Override public int run(String[] args) throws Exception { Configuration conf = new Configuration(); //删除已经存在的输出目录 Path mypath = new Path(args[1]); FileSystem hdfs = mypath.getFileSystem(conf); if (hdfs.isDirectory(mypath)) { hdfs.delete(mypath, true); } Job job = new Job(conf, "testAnagram"); job.setJarByClass(AnagramMain. class); //设置主类 job.setMapperClass(AnagramMapper. class); //Mapper job.setMapOutputKeyClass(Text. class); job.setMapOutputValueClass(Text. class); job.setReducerClass(AnagramReducer. class); //Reducer job.setOutputKeyClass(Text. class); job.setOutputValueClass(Text. class); FileInputFormat.addInputPath(job, new Path(args[0])); //设置输入路径 FileOutputFormat. setOutputPath(job, new Path(args[1])); //设置输出路径 job.waitForCompletion( true); return 0; } public static void main(String[] args) throws Exception{ String[] args0 = { "hdfs://cloud004:9000/anagram/anagram.txt" , "hdfs://cloud004:9000/anagram/output"}; int ec = ToolRunner.run( new Configuration(), new AnagramMain(), args0); System. exit(ec); } }
2、编写Mapper:AnagramMapper
package com.hadoop.test; import java.io.IOException; import java.util.Arrays; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class AnagramMapper extends Mapper< Object, Text, Text, Text> { private Text sortedText = new Text(); private Text orginalText = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String word = value.toString(); char[] wordChars = word.toCharArray();//单词转化为字符数组 Arrays.sort(wordChars);//对字符数组按字母排序 String sortedWord = new String(wordChars);//字符数组转化为字符串 sortedText.set(sortedWord);//设置输出key的值 orginalText.set(word);//设置输出value的值 context.write( sortedText, orginalText );//map输出 } }
3、编写Reducer:AnagramReducer
package com.hadoop.test; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class AnagramReducer extends Reducer< Text, Text, Text, Text> { private Text outputKey = new Text(); private Text outputValue = new Text(); public void reduce(Text anagramKey, Iterable< Text> anagramValues, Context context) throws IOException, InterruptedException { String output = ""; //对相同字母组成的单词,使用 ~ 符号进行拼接 for(Text anagam:anagramValues){ if(!output.equals("")){ output = output + "~" ; } output = output + anagam.toString() ; } StringTokenizer outputTokenizer = new StringTokenizer(output,"~" ); //输出anagrams(字谜)大于2的结果 if(outputTokenizer.countTokens()>=2) { output = output.replace( "~", ","); outputKey.set(anagramKey.toString());//设置key的值 outputValue.set(output);//设置value的值 context.write( outputKey, outputValue);//reduce } } }
编译和执行MapReduce作业
1、将项目编译和打包为anagram.jar,使用 SSH 客户端将 anagram.jar上传至hadoop的/home/hadoop/djt目录下。
2、使用cd /home/hadoop/djt 切换到当前目录,通过命令行执行任务。
hadoop jar anagram.jar com.hadoop.test.AnagramMain
查看运行结果
任务的最终结果输出到 HDFS ,使用如下命令查看结果。
[[email protected] hadoop-2.2.0-x64]$ hadoop fs -cat /anagram/output/part-r-00000
部分结果集如下所示。
cehors cosher,chores,ochres,ochers cehorst troches,hectors,torches cehort troche,hector cehortu toucher,couther,retouch cehoss coshes,choses cehrt chert,retch cehstu chutes,tusche cehsty chesty,scythe ceht etch,tech ceiijstu jesuitic,juiciest ceiikst ickiest,ekistic ceiilnos isocline,silicone ceiilnoss isoclines,silicones ceiimmnoorss commissioner,recommission ceiimmnoorsss recommissions,commissioners ceiimorst isometric,eroticism ceiimost semiotic,comities ceiinnopst inceptions,inspection ceiinrsstu scrutinies,scrutinise ceiinrst citrines,crinites,inciters ceiinrt citrine,inciter ceiinss iciness,incises ceiintz citizen,zincite ceiist iciest,cities ceikln nickel,nickle ceiklnr crinkle,clinker ceiklnrs clinkers,crinkles ceiklns nickels,nickles ceiklrs slicker,lickers ceiklrsst sticklers,strickles ceiklrst trickles,ticklers,stickler ceiklrt tickler,trickle ceiklsst slickest,stickles ceiklst keltics,stickle,tickles ceiklt tickle,keltic ceiknrs nickers,snicker ceikorr rockier,corkier ceikorst stockier,corkiest,rockiest ceikpst skeptic,pickets ceikrst rickets,tickers,sticker ceil lice,ceil ceilmop compile,polemic ceilmopr compiler,complier ceilmoprs compliers,compilers ceilmops polemics,complies,compiles ceilnoos colonise,colonies ceilnors incloser,licensor ceilnorss inclosers,licensors
时间: 2024-10-28 11:36:57