使用Lucene实现多个文档关键词检索demo（二）

上次在使用Lucene建立索引时使用的时自带的StandAnalyzer分词器，而这个分词器在对中文进行分词时只是机械的按字进行划分，因此使用它lucene就不能很好的对中文索引，也就不能实现对中文关键词的检索了，因此其实上次的实践只能对英文进行。

为了解决这个问题，可以使用IKAnalyzer，它是以开源项目Lucene为应用主体的，结合词典分词和文法分析算法的中文分词组件。它支持中英文等分词。

接下来就用它来改善检索功能，首先是下载IKAnalyzer开发包，我将开发包上传到了这里密钥：7j78这个包我已经实践过，之前下别的版本会与最新的lucene产生冲突。包里面有jar包，两个配置文件以及使用文档的pdf，一个配置文件是用来配置扩展字典以及需要过滤的无意义词的字典，另一个则是直接在里面写上需要过滤哪些无意义词。

下面的代码是分别用StandAnalyzer和IKAnalyzer对同一个中文字符串分别进行分词

public static void analysis(){
		Analyzer luceneAnalyzer = new StandardAnalyzer();//分词工具
		try {
			TokenStream tokenStream=luceneAnalyzer.tokenStream("","明天就是国庆节了");
			   CharTermAttribute term=tokenStream.getAttribute(CharTermAttribute.class);
			   tokenStream.reset();
		        //遍历分词数据
		        while(tokenStream.incrementToken()){
		            System.out.print(term.toString()+"|");
		        }
		        tokenStream.clearAttributes();
		        tokenStream.close();

		} catch (IOException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}
	public static void analysisByIK() throws IOException{
		String text="明天就是国庆节了";
		IKSegmenter ik = new IKSegmenter(new StringReader(text),true);
		Lexeme lexeme = null;
	    while((lexeme = ik.next())!=null){
	    	System.out.print(lexeme.getLexemeText()+"|");
	    }
	}

运行结果如下：

明|天|就|是|国|庆|节|了|

明天|就是|国庆节|了|

可以看到IKAnalyzer分词的结果更加符合期望。

下面代码是使用IKAnalyzer在建立文档索引时替换原先的StandAnalyer。

public static void buildIndex(String idir,String dDir)throws IOException{
		File indexDir = new File(idir);// 索引存放目录
		File dataDir = new File(dDir);// 需要建立索引的文件目录
		Analyzer luceneAnalyzer = new IKAnalyzer();//分词工具
		File[] dataFiles = dataDir.listFiles();
		IndexWriterConfig indexConfig = new IndexWriterConfig(Version.LATEST, luceneAnalyzer);
		FSDirectory fsDirectory = null;
		IndexWriter indexWriter = null;
		try {
			fsDirectory = FSDirectory.open(indexDir);// 索引目录
			indexWriter = new IndexWriter(fsDirectory, indexConfig);// 用于创建索引的对象
			long startTime = new Date().getTime();
			for (int i = 0; i < dataFiles.length; i++) {
				if (dataFiles[i].isFile() && dataFiles[i].getName().endsWith(".txt")) {
					Document document = new Document();//代表一个文档
					System.out.println(dataFiles[i].getPath());
					Reader txtReader = new FileReader(dataFiles[i]);
					FieldType fieldType = new FieldType();
					fieldType.setIndexed(true);
					document.add(new TextField("path",dataFiles[i].getCanonicalPath(),Store.YES));//Field是用来描述文档的属性，比如这里文档设置了两个属性，路径和内容
					document.add(new Field("contents", txtReader, fieldType));
					indexWriter.addDocument(document);
				}
			}
			indexWriter.commit();//为文档建立索引
			long endTime = new Date().getTime();
			System.out.println("It takes " + (endTime - startTime) + " milliseconds to create index for the files in directory " + dataDir.getPath());
		} catch (IOException e) {
			e.printStackTrace();
			try {
				indexWriter.rollback();
			} catch (IOException e1) {
				e1.printStackTrace();
			}
		} finally {
			if(indexWriter!=null){
				indexWriter.close();
			}
		}
	}

时间： 2024-10-10 23:44:34

使用Lucene实现多个文档关键词检索demo（二）

使用Lucene实现多个文档关键词检索demo（二）的相关文章

使用Lucene实现多个文档关键词检索demo（一）

dedecms批量删除文档关键词可以吗

Lucene是如何理解文档的 & 文档类型（Types）是如何被实现的

Python TF-IDF计算100份文档关键词权重

Spring3.0官网文档学习笔记（二）

跟着文档学python（二）：time.time() 与 time.clock() 的对比与总结

微软office web apps 服务器搭建之在线文档预览（二）

产品需求文档的学习记录(二)

90. 基于Notes/Domino的文档工作流系统（二）