最近在做一个有关文本挖掘的项目,需要用到Ngram模型已经相对应的向量匹配相似度的技术
Ngram分词的程序
有位网友在问我,想了想写在这里吧,至于那些jar包也很好找,lucene jar ,在百度搜索都能找到
package edu.fjnu.huanghong; import java.io.IOException; import java.io.StringReader; import org.apache.lucene.analysis.Tokenizer; import org.apache.lucene.analysis.ngram.NGramTokenizer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; import org.apache.lucene.analysis.tokenattributes.PositionLengthAttribute; import org.apache.lucene.analysis.tokenattributes.TermToBytesRefAttribute; import org.apache.lucene.analysis.tokenattributes.TypeAttribute; import org.apache.lucene.util.Version; /* * * import org.apache.lucene.analysis.ngram.Lucene43EdgeNGramTokenizer; import org.apache.lucene.analysis.ngram.Lucene43NGramTokenizer; * */ public class Ngram { public static void main(String[] args) { String s = "捡 白色 iphone6 手机 壳 透明 失主 方式 15659119418 "; String[] str = s.split(" "); StringBuilder sb = new StringBuilder(); for(int i = 0; i < str.length; i++){ sb.append(str[i]); } System.out.println(sb.toString()); StringReader sr = new StringReader(sb.toString()); //N-gram模型分词器 Tokenizer tokenizer = new NGramTokenizer(Version.LUCENE_45,sr); testtokenizer(tokenizer); } private static void testtokenizer(Tokenizer tokenizer) { try { tokenizer.reset(); while(tokenizer.incrementToken()) <span style="white-space:pre"> </span>{ CharTermAttribute charTermAttribute=tokenizer.addAttribute(CharTermAttribute.class); ) System.out.print(charTermAttribute.toString()+"|"); } tokenizer.end(); tokenizer.close(); } catch (IOException e) { e.printStackTrace(); } } }
不知道有没有 哪位前辈有关于qgram的相关知识- - ,翻墙了都找不到,如有希望能私信我,不胜感激
版权声明:本文为博主原创文章,未经博主允许不得转载。
时间: 2024-10-12 08:31:12