1.自动文本分类是对大量的非结构化的文字信息(文本文档、网页等)按照给定的分类体系,根据文字信息内容分到指定的类别中去,是一种有指导的学习过程。
分类过程采用基于统计的方法和向量空间模型可以对常见的文本网页信息进行分类,分类的准确率可以达到85%以上。分类速度50篇/秒。
2.要想分类必须先分词,进行文本分词的文章链接常见的四种文本自动分词详解及IK Analyze的代码实现
3.废话不多说直接贴代码,原理链接https://www.cnblogs.com/pinard/p/6069267.html
4.代码
import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.ml.feature.HashingTF import org.apache.spark.ml.feature.IDF import org.apache.spark.ml.feature.Tokenizer import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.sql.Row import scala.reflect.api.materializeTypeTag object TestNaiveBayes { case class RawDataRecord(category: String, text: String) def main(args : Array[String]) { /*val conf = new SparkConf().setMaster("yarn-client") val sc = new SparkContext(conf)*/ val conf = new SparkConf().setMaster("local").setAppName("reduce") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ var srcRDD = sc.textFile("C:/Users/dell/Desktop/大数据/分类细胞词库").map { x => var data = x.split(",") RawDataRecord(data(0),data(1)) } var trainingDF = srcRDD.toDF() //将词语转换成数组 var tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") var wordsData = tokenizer.transform(trainingDF) println("output1:") wordsData.select($"category",$"text",$"words").take(1).foreach(println) //计算每个词在文档中的词频 var hashingTF = new HashingTF().setNumFeatures(50000).setInputCol("words").setOutputCol("rawFeatures") var featurizedData = hashingTF.transform(wordsData) println("output2:") featurizedData.select($"category", $"words", $"rawFeatures").take(1).foreach(println) //计算每个词的TF-IDF var idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") var idfModel = idf.fit(featurizedData) var rescaledData = idfModel.transform(featurizedData) println("output3:") rescaledData.select($"category", $"features").take(1).foreach(println) //转换成Bayes的输入格式 var trainDataRdd = rescaledData.select($"category",$"features").map { case Row(label: String, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) } println("output4:") trainDataRdd.take(1) //训练热词数据 val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial") var srcRDD1 = sc.textFile("C:/Users/dell/Desktop/大数据/热词细胞词库/热词数据1.txt").map { x => var data = x.split(",") RawDataRecord(data(0),data(1)) } var testDF = srcRDD1.toDF() //将热词数据做同样的特征表示及格式转换 var testwordsData = tokenizer.transform(testDF) var testfeaturizedData = hashingTF.transform(testwordsData) var testrescaledData = idfModel.transform(testfeaturizedData) var testDataRdd = testrescaledData.select($"category",$"features").map { case Row(label: String, features: Vector) => LabeledPoint(label.toDouble, Vectors.dense(features.toArray)) } //对热词数据数据集使用训练模型进行分类预测 训练模型就是提前弄好的分类数据细胞集 val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label)) println("output5:") testpredictionAndLabel.foreach(println) } }
代码网上找的好几天前的了,找不到出处了,侵删
找到了。https://blog.csdn.net/yumingzhu1/article/details/85064047
5.jar包依赖
可能不需要这么多,自己甄别吧
需要什么没补充或者不懂得可以评论,因为太晚了,就写到这样吧
原文地址:https://www.cnblogs.com/zpsblog/p/10591136.html
时间: 2024-10-08 15:25:47