mahout分类学习和遇到的问题总结

这段时间学习Mahout有喜有悲,在这里首先感谢樊哲老师的指导,下面列出关于这次Mahout分类的学习和遇到的问题,还请大家多多提出建议:(所有文件操作都使用是在hdfs上边进行的)。

(本人用的环境是Mahout0.9+hadoop-2.2.0)

一、首先将预分类文件转换为序列化化存储:

下边图片列出的是使用的20newsgroup数据(我使用的linux上的eclipse,然后在eclipse上边安装的eclipse-hadoop插件),数据图片如下:

然后编写java代码将此20newsgroup分类文件转换为sequence文件存储,如下图:

执行上边的程序,最后生成下列序列化文件:

写程序查看序列化文件的内容如下:

二、将序列化文件转储为向量文件:

输入上边的序列化文件的输出结果将序列化文件转换为向量文件,转换java代码如下:

转换为向量文件结果如下图所示:

使用java程序查看向量文件的tfidf-vectors下的part-m-00000内容如下(一部分结果):

key= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/sci.med/58064   value= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/sci.med/58064:{40949:4.404207229614258,58885:6.966500282287598,52884:1.404096007347107,54411:6.043336868286133,25074:16.67474365234375,20175:6.222922325134277,32290:7.155742168426514,57777:6.696209907531738,59820:5.825411796569824,53261:5.4359564781188965,15068:9.235183715820312,68017:4.232685565948486,47050:9.235183715820312,39193:8.724358558654785,56670:3.387782096862793,69370:1.5702117681503296,21521:7.7688469886779785,54393:5.597597599029541,62286:4.329909324645996,11644:7.128331661224365,12511:1.7634414434432983,17861:4.079967498779297,39895:7.338063716888428,55445:7.62574577331543,50765:9.235183715820312,41274:7.338063716888428,19358:7.155742168426514,19837:9.235183715820312,46237:9.235183715820312,24355:2.2430875301361084,52769:7.289273738861084,24217:1.9966870546340942,19749:3.0007731914520264,28651:7.114920139312744,24284:2.4492197036743164,61132:6.2394514083862305,39146:5.458599090576172,34340:7.289273738861084,34936:7.443424224853516,10911:6.129103660583496,12647:8.254354476928711,47554:1.0402182340621948,40788:7.338063716888428,10482:7.037959098815918,62155:3.3696606159210205,33813:1.230707049369812,24044:8.947502136230469,63344:5.73867654800415,68080:6.2394514083862305,17029:5.118860244750977,65110:7.075699806213379,32127:5.7694478034973145,30288:3.71773099899292,13537:3.087428092956543,11545:8.254354476928711,47708:7.935900688171387,47276:2.321446418762207,24045:8.947502136230469,56652:3.911620616912842,10107:12.653678894042969,10233:3.475231170654297,34068:5.321562767028809,58884:7.338063716888428,68165:1.6761456727981567,40591:9.235183715820312,47733:11.100005149841309,58624:4.493154525756836,50783:11.100005149841309,44628:6.572596073150635}

key= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/talk.politics.guns/54390   value= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/talk.politics.guns/54390:{25902:9.235183715820312,47590:4.066595554351807,48033:3.2437193393707275,67021:2.7126009464263916,24472:8.947502136230469,44782:1.9297716617584229,42373:5.015676021575928,24015:2.4575374126434326,11106:1.9394488334655762,42273:2.070721387863159,62394:3.628157138824463,61226:7.848889350891113,44453:2.7429440021514893,21501:15.389477729797363,32332:3.7448697090148926,64726:1.9358370304107666,48742:7.499622821807861,60003:15.995806694030762,50448:4.258450031280518,14327:3.194135904312134,50798:7.7688469886779785,47387:3.7190706729888916,47554:1.0402182340621948,11201:3.9916746616363525,53791:2.5926971435546875,19329:7.785928249359131,52953:2.958540439605713,13970:5.588863849639893,46327:8.387886047363281,58697:4.107259273529053,49227:15.995806694030762,18911:6.383354663848877,50439:3.510509967803955,9861:1.9852583408355713,56602:3.647935152053833,50458:1.7995492219924927,36905:5.748828887939453,66718:5.264892101287842,33813:2.7519447803497314,68017:2.9929606914520264,23442:3.458564043045044,1890:3.5897369384765625,44013:5.357062339782715,35455:3.657973051071167,65123:2.995558023452759,56080:7.561207294464111,40309:4.490251541137695,26572:8.628969192504883,23439:3.0159199237823486,27894:5.060796737670898,46052:1.8622281551361084,18320:5.84535026550293,17803:3.757326602935791,33291:1.8489196300506592}

key= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/comp.windows.x/67312   value= /hdfs://h1:9000/classifier/classifier-src4/20news-bydate-train/comp.windows.x/67312:{47554:1.0402182340621948,6258:3.1925511360168457,33291:1.8489196300506592,788:5.197997570037842,32485:4.425988674163818,53061:4.7731146812438965,68774:6.114288330078125,4487:4.123196125030518,65155:4.61021089553833,65670:5.0815229415893555,5285:10.784433364868164,50458:1.7995492219924927,35455:5.173154830932617,46052:1.8622281551361084,27311:8.298446655273438,8749:3.7985548973083496,26321:6.401970386505127,4587:13.633935928344727,1109:3.6212668418884277,34867:6.085300922393799,41201:3.171398639678955,25533:13.633935928344727}

三、生成训练模型:

输入上边生成的tfidf-vectors下的part-m-00000文件生成训练模型具体代码如下:

生成结果如下图所示:

(1)、训练模型

(2)、indexlabel文件

经过查看indexlabel的内容发现indexlabel不正确。

indexlabel的内容如下:(显然不正确不知道为什么,网上搜了很多内容都没有解决掉错误)

key= hdfs:   value= 0

自己尝试的解决办法修改Mahout源码的org.apache.mahout.classifier.naivebayes.BayesUtils.java中代码toString())[1]改为toString())[7],如下图所示:

这里修改这个数组下标的原因是BayesUtils.java默认使用“/”来分隔目录,数组的toString())[7]正好是所需要的文件类型标识。

改为:

这时候indexlabel文件内容为(总共20个类型标识显示正确),如下图所示:

可是在执行训练模型的时候又出现了新的错误。。(如下执行过程所示):

2014-07-21 20:45:27,738 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir

2014-07-21 20:45:27,738 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.compress.map.output is deprecated. Instead, use mapreduce.map.output.compress

2014-07-21 20:45:27,738 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir

2014-07-21 20:45:27,837 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(56)) - Connecting to ResourceManager at h1/192.168.1.130:8032

2014-07-21 20:45:28,085 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(149)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2014-07-21 20:45:28,963 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1

2014-07-21 20:45:29,025 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(394)) - number of splits:1

2014-07-21 20:45:29,036 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - user.name is deprecated. Instead, use mapreduce.job.user.name

2014-07-21 20:45:29,037 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.jar is deprecated. Instead, use mapreduce.job.jar

2014-07-21 20:45:29,037 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.cache.files.filesizes is deprecated. Instead, use mapreduce.job.cache.files.filesizes

2014-07-21 20:45:29,037 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.cache.files is deprecated. Instead, use mapreduce.job.cache.files

2014-07-21 20:45:29,038 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class

2014-07-21 20:45:29,038 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class

2014-07-21 20:45:29,038 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapreduce.combine.class is deprecated. Instead, use mapreduce.job.combine.class

2014-07-21 20:45:29,045 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class

2014-07-21 20:45:29,045 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.job.name is deprecated. Instead, use mapreduce.job.name

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.cache.files.timestamps is deprecated. Instead, use mapreduce.job.cache.files.timestamps

2014-07-21 20:45:29,046 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class

2014-07-21 20:45:29,047 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class

2014-07-21 20:45:29,047 INFO  [main] Configuration.deprecation (Configuration.java:warnOnceIfDeprecated(840)) - mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir

2014-07-21 20:45:29,110 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(477)) - Submitting tokens for job: job_1405911825122_0013

2014-07-21 20:45:29,310 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(174)) - Submitted application application_1405911825122_0013 to ResourceManager at h1/192.168.1.130:8032

2014-07-21 20:45:29,349 INFO  [main] mapreduce.Job (Job.java:submit(1272)) - The url to track the job: http://h1:8088/proxy/application_1405911825122_0013/

2014-07-21 20:45:29,349 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1405911825122_0013

2014-07-21 20:45:34,926 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1338)) - Job job_1405911825122_0013 running in uber mode : false

2014-07-21 20:45:34,928 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 0% reduce 0%

2014-07-21 20:45:44,229 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 100% reduce 0%

2014-07-21 20:45:49,803 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 100% reduce 100%

2014-07-21 20:45:49,818 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1356)) - Job job_1405911825122_0013 completed successfully

2014-07-21 20:45:49,910 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1363)) - Counters: 44

File System Counters

FILE: Number of bytes read=680

FILE: Number of bytes written=161999

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=18538518

HDFS: Number of bytes written=97

HDFS: Number of read operations=7

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=42768

Total time spent by all reduces in occupied slots (ms)=2352

Map-Reduce Framework

Map input records=11314

Map output records=0

Map output bytes=0

Map output materialized bytes=14

Input split bytes=151

Combine input records=0

Combine output records=0

Reduce input groups=0

Reduce shuffle bytes=14

Reduce input records=0

Reduce output records=0

Spilled Records=0

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=427

CPU time spent (ms)=3470

Physical memory (bytes) snapshot=270143488

Virtual memory (bytes) snapshot=808521728

Total committed heap usage (bytes)=201457664

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=18538367

File Output Format Counters

Bytes Written=97

org.apache.mahout.classifier.naivebayes.training.IndexInstancesMapper$Counter

SKIPPED_INSTANCES=11314

2014-07-21 20:45:49,934 INFO  [main] client.RMProxy (RMProxy.java:createRMProxy(56)) - Connecting to ResourceManager at h1/192.168.1.130:8032

2014-07-21 20:45:49,961 WARN  [main] mapreduce.JobSubmitter (JobSubmitter.java:copyAndConfigureFiles(149)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.

2014-07-21 20:45:50,218 INFO  [main] input.FileInputFormat (FileInputFormat.java:listStatus(287)) - Total input paths to process : 1

2014-07-21 20:45:50,245 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(394)) - number of splits:1

2014-07-21 20:45:50,853 INFO  [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(477)) - Submitting tokens for job: job_1405911825122_0014

2014-07-21 20:45:50,905 INFO  [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(174)) - Submitted application application_1405911825122_0014 to ResourceManager at h1/192.168.1.130:8032

2014-07-21 20:45:50,908 INFO  [main] mapreduce.Job (Job.java:submit(1272)) - The url to track the job: http://h1:8088/proxy/application_1405911825122_0014/

2014-07-21 20:45:50,908 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1317)) - Running job: job_1405911825122_0014

2014-07-21 20:46:01,907 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1338)) - Job job_1405911825122_0014 running in uber mode : false

2014-07-21 20:46:01,907 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 0% reduce 0%

2014-07-21 20:46:06,258 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 100% reduce 0%

2014-07-21 20:46:11,815 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1345)) -  map 100% reduce 100%

2014-07-21 20:46:11,824 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1356)) - Job job_1405911825122_0014 completed successfully

2014-07-21 20:46:11,852 INFO  [main] mapreduce.Job (Job.java:monitorAndPrintJob(1363)) - Counters: 43

File System Counters

FILE: Number of bytes read=22

FILE: Number of bytes written=162241

FILE: Number of read operations=0

FILE: Number of large read operations=0

FILE: Number of write operations=0

HDFS: Number of bytes read=237

HDFS: Number of bytes written=90

HDFS: Number of read operations=7

HDFS: Number of large read operations=0

HDFS: Number of write operations=2

Job Counters

Launched map tasks=1

Launched reduce tasks=1

Data-local map tasks=1

Total time spent by all maps in occupied slots (ms)=12960

Total time spent by all reduces in occupied slots (ms)=2214

Map-Reduce Framework

Map input records=0

Map output records=0

Map output bytes=0

Map output materialized bytes=14

Input split bytes=140

Combine input records=0

Combine output records=0

Reduce input groups=0

Reduce shuffle bytes=14

Reduce input records=0

Reduce output records=0

Spilled Records=0

Shuffled Maps =1

Failed Shuffles=0

Merged Map outputs=1

GC time elapsed (ms)=31

CPU time spent (ms)=1390

Physical memory (bytes) snapshot=300437504

Virtual memory (bytes) snapshot=809414656

Total committed heap usage (bytes)=225509376

Shuffle Errors

BAD_ID=0

CONNECTION=0

IO_ERROR=0

WRONG_LENGTH=0

WRONG_MAP=0

WRONG_REDUCE=0

File Input Format Counters

Bytes Read=97

File Output Format Counters

Bytes Written=90

Exception in thread "main" java.lang.NullPointerException

at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:187)

at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:81)

at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:162)

at com.redhadoop.trainnb.TrainnbTest.main(TrainnbTest.java:48)

所指的BayesUtils.java 81行的代码如下:

四、使用训练模型进行测试

测试的java代码如下:

测试结果(明显不正确):

在这一块奋斗好几个日日夜夜,可是错误(上面的红色部分)还是没有解决,还是无奈。。求大神帮忙o(╯□╰)o!

时间: 2024-12-21 06:11:19

mahout分类学习和遇到的问题总结的相关文章

文本分类学习 (七)支持向量机SVM 的前奏 结构风险最小化和VC维度理论

前言: 经历过文本的特征提取,使用LibSvm工具包进行了测试,Svm算法的效果还是很好的.于是开始逐一的去了解SVM的原理. SVM 是在建立在结构风险最小化和VC维理论的基础上.所以这篇只介绍关于SVM的理论基础. 目录: 文本分类学习(一)开篇 文本分类学习(二)文本表示 文本分类学习(三)特征权重(TF/IDF)和特征提取        文本分类学习(四)特征选择之卡方检验 文本分类学习(五)机器学习SVM的前奏-特征提取(卡方检验续集) 文本分类学习(六)AdaBoost和SVM(残)

机器学习之多分类学习

一.问题描述 现实中常遇到多分类学习任务,有些二分类学习方法可直接推广到多分类,但在更多情况下,我们是基于一些基本策略,利用二分类学习器来解决多分类问题. 假设有N个类别C1,C2,......,CN,多分类学习的基本思路是“拆解法”,即将多分类任务拆分为若干个二分类任务求解.具体来说,先对问题进行拆分,然后为拆出 的每个二分类任务训练一个分类器.在测试的时候,对这些分类器的预测结果进行集成以获得最终的多分类结果.因此,如何对多分类任务进行拆分是关键.这里 主演介绍三种经典的拆分策略:一对一(O

Mahout 分类算法

实验简介 本次课程学习了Mahout 的 Bayes 分类算法. 一.实验环境说明 1. 环境登录 无需密码自动登录,系统用户名 shiyanlou 2. 环境介绍 本实验环境采用带桌面的Ubuntu Linux环境,实验中会用到桌面上的程序: XfceTerminal: Linux 命令行终端,打开后会进入 Bash 环境,可以使用 Linux 命令: Firefox:浏览器,可以用在需要前端界面的课程里,只需要打开环境里写的 HTML/JS 页面即可: GVim:非常好用的编辑器,最简单的用

ARM指令分类学习

指令分类: 1.算数和逻辑指令 2.比较指令 3.跳转指令 4.移位指令 5.程序状态字访问指令 6.存储器访问指令 ++++++++++++++++++++++++++++++++++++++++++++++++++ 学习指令的资料<arm汇编手册(中文版).chm> ,注:这个资料是 ARM汇编手册,我们用的是GNU的汇编,所以语法 大小写上是有差别的. 使用上一篇文章中的汇编程序来,学习使用每个指令的用法. 一.算数和逻辑指令 1.mov指令 作用.格式.例子 从另一个寄存器.被移位的寄

Linux文件属性-目录-命令分类学习

一:Linux常见文件类型 -:普通文件(f) [[email protected] ~]# ls -l -rw------- 1 root root      1017 Dec  8 07:47 anaconda-ks.cfg b:块设备文件(block)[随机访问的设备,按数据块(512byte)为单位,如硬盘/dev/hda1 ./dev/sda2./dev/fd0] [[email protected] dev]# ls -l brw-rw---- 1 root disk   22,  

文本分类学习(三) 特征权重(TF/IDF)和特征提取

上一篇中,主要说的就是词袋模型.回顾一下,在进行文本分类之前,我们需要把待分类文本先用词袋模型进行文本表示.首先是将训练集中的所有单词经过去停用词之后组合成一个词袋,或者叫做字典,实际上一个维度很大的向量.这样每个文本在分词之后,就可以根据我们之前得到的词袋,构造成一个向量,词袋中有多少个词,那这个向量就是多少维度的了.然后就把这些向量交给计算机去计算,而不再需要文本啦.而向量中的数字表示的是每个词所代表的权重.代表这个词对文本类型的影响程度. 在这个过程中我们需要解决两个问题:1.如何计算出适

Html学习(三) 分类学习

代码: <h1>这是一级分类吗</h1> <h2>这是二级分类吗</h2> <h3>这是三级分类吗 </h3> 效果: 介绍: <abbr>(表示缩写),<em>(表示强调).<strong>(表示更强地强调),<cite>(表示引用),<address>(表示地址)等等.这些标签不是为了定义显示效果而存在的.所以从浏览器里看它们可能没有不论什么效果,也可能不同的浏览器对这些

『TensorFlow』第四弹_classification分类学习_拨云见日

本节是以mnist手写数字识别为例,实现了分类网络(又双叒叕见mnist......) ,不过这个分类器实现很底层,能学到不少tensorflow的用法习惯,语言不好描述,值得注意的地方使用大箭头标注了(原来tensorflow的向前传播的调用是这么实现的...之前自己好蠢): 1 import tensorflow as tf 2 from tensorflow.examples.tutorials.mnist import input_data 3 4 '''数据下载''' 5 # one_

IMDB情感分类学习

需要学习链接: 使用pandas做预处理,https://blog.csdn.net/mpk_no1/article/details/71698725 https://www.jianshu.com/p/8d3f929c9444 1.我的想法: 1.首先是要读取数据集,建立字典,将word转为id准备输入: 2.想获取数据文本的长度分布,然后做截断,但不知道怎么写: 但是链接中考虑的更全面 1.去掉非ASCII字符,2.去掉换行符,3.转换为小写. https://blog.csdn.net/i