Lucene实现自定义分词器(同义词查询与高亮)

今天我们实现一个简单的分词器，仅仅做演示使用功能如下：

1、分词按照空格、横杠、点号进行拆分；

2、实现hi与hello的同义词查询功能；

3、实现hi与hello同义词的高亮显示；

MyAnalyzer实现代码：

public class MyAnalyzer extends Analyzer {
	private int analyzerType;

	public MyAnalyzer(int type) {
		super();
		analyzerType = type;
	}

	@Override
	protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
		MyTokenizer tokenizer = new MyTokenizer(fieldName, reader, analyzerType);
		return new TokenStreamComponents(tokenizer);
	}
}

MyTokenizer实现代码：

public class MyTokenizer extends Tokenizer {
	public class WordUnit{
		WordUnit(String word, int start, int length){
			this.word = word;
			this.start = start;
			this.length = length;
//System.out.println("\tWordUnit: " + word + "|" + start + "|" + length);
		}

		String word;
		int start;
		int length;
	}

	private int analyzerType;
	private int endPosition;
	private Iterator<WordUnit> it;
	private ArrayList<WordUnit> words;

	private final CharTermAttribute termAtt;
	private final OffsetAttribute offsetAtt;

	public MyTokenizer(String fieldName, Reader in, int type) {
		super(in);

		it = null;
		endPosition = 0;
		analyzerType = type;
		offsetAtt = addAttribute(OffsetAttribute.class);
		termAtt = addAttribute(CharTermAttribute.class);
		addAttribute(PayloadAttribute.class);
	}	

	@Override
	public boolean incrementToken() throws IOException {
		clearAttributes();

		char[] inputBuf = new char[1024];
		if(it == null) {
			int bufSize = input.read(inputBuf);
			if(bufSize <= 0) return false;

			int beginIndex = 0;
			int endIndex = 0;
			words = new ArrayList<WordUnit>();
			for(endIndex = 0; endIndex < bufSize; endIndex++) {
				if(inputBuf[endIndex] != '-' && inputBuf[endIndex] != ' ' && inputBuf[endIndex] != '.') continue;

				addWord(inputBuf, beginIndex, endIndex);
				beginIndex = endIndex + 1;
			}
			addWord(inputBuf, beginIndex, endIndex);//add the last

			if(words.isEmpty()) return false;
			it = words.iterator();
		}

		if(it != null && it.hasNext()){
			WordUnit word = it.next();
			termAtt.append(word.word);
			termAtt.setLength(word.word.length());

			endPosition = word.start + word.length;
			offsetAtt.setOffset(word.start, endPosition);

			return true;
		}

		return false;
	}

	@Override
	public void reset() throws IOException {
		super.reset();

		it = null;
		endPosition = 0;
	}

	@Override
	public final void end() {
		int finalOffset = correctOffset(this.endPosition);
		offsetAtt.setOffset(finalOffset, finalOffset);
	}

	private void addWord(char[] inputBuf, int begin, int end){
		if(end <= begin) return;

		String word = new String(inputBuf, begin, end - begin);
		words.add(new WordUnit(word, begin, end - begin));

	 	if(analyzerType == 0 && word.equals("hi")) words.add(new WordUnit("hello", begin, 2));
	 	if(analyzerType == 0 && word.equals("hello")) words.add(new WordUnit("hi", begin, 5));
	}
}

索引的时候分词器类型：analyzerType=0；

搜索的时候分词器类型：analyzerType=1；

高亮的时候分词器类型：analyzerType=0；

搜索hello时的效果如下：

Score doc 0 hightlight to: look <em>hello</em> on
Score doc 1 hightlight to: I am <em>hi</em> China Chinese

可以看到含有hi的文档也被搜索出来，同样也会高亮。

时间： 2024-10-13 19:38:23

Lucene实现自定义分词器(同义词查询与高亮)的相关文章

重写lucene.net的分词器支持3.0.3.0版本

lucene.net中每个分词器都是一个类,同时有一个辅助类,这个辅助类完成分词的大部分逻辑.分词类以Analyzer结尾,辅助类通常以Tokenizer结尾.分类词全部继承自Analyzer类,辅助类通常也会继承某个类. 首先在Analysis文件夹下建立两个类,EasyAnalyzer和EasyTokenizer. 1 using Lucene.Net.Analysis; 2 using System.IO; 3 4 namespace LuceneNetTest 5 { 6 public

elasticsearch集群&&IK分词器&&同义词

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.3/elasticsearch-2.3.3.tar.gz 集群安装: 三个节点:master,slave1,slvae2 vi elasticsearch.yml cluster.name: my-application node.name: node-3(节点独有的名称,注意唯一性

搜索引擎系列四：Lucene提供的分词器、IKAnalyze中文分词器集成

一.Lucene提供的分词器StandardAnalyzer和SmartChineseAnalyzer 1.新建一个测试Lucene提供的分词器的maven项目LuceneAnalyzer 2. 在pom.xml里面引入如下依赖  <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId&

lucene自定义分词器

工作上,有需要对纯数字的字符串做分词,好像CJK二元分词器对这样的数字不会做分词,所以自己写了个分词器,分词器达到以下效果:对字符串1234567,分词后为:12 34 56 7 Analyzer: package org.apache.lucene.analysis.core; import java.io.Reader; import java.io.StringReader; import org.apache.lucene.analysis.Analyzer; import org.ap

Lucene的中文分词器IKAnalyzer

分词器对英文的支持是非常好的. 一般分词经过的流程: 1)切分关键词 2)去除停用词 3)把英文单词转为小写但是老外写的分词器对中文分词一般都是单字分词,分词的效果不好. 国人林良益写的IK Analyzer应该是最好的Lucene中文分词器之一,而且随着Lucene的版本更新而不断更新,目前已更新到IK Analyzer 2012版本. IK Analyzer是一个开源的,基于java语言开发的轻量级的中文分词工具包.到现在,IK发展为面向Java的公用分词组件,独立于Lucene项目,同时

Lucene实现自定义中文同义词分词器

---------------------------------------------------------- lucene的分词_中文分词介绍 ---------------------------------------------------------- Paoding:庖丁解牛分词器.已经没有更新了 mmseg:使用搜狗的词库 1.导入包(有两个包:1.带dic的,2.不带dic的) 如果使用不带dic的,得自己指定词库位置 2.创建MMSegAnalyzer(指明词库所在的位置

自定义Lucene分词器示例

集团的内部通讯工具搜同事时,需要根据姓名后缀进行搜索.譬如"徐欢春",我们要能根据"欢春"搜出这个人:"黄继刚",要根据"继刚"为关键字搜出"黄继刚".这是个很人性化的用户体验,当我们有同事的名字是三个字的时候,我们通常会叫他们名字的最后两个字.Lucene本身并没有提供这种分词器,只能自己照着Lucene已有的分词器进行模仿开发. 参照ngram分词器进行开发. 实现一个Tokenizer和一个工厂类就可

Solr文本分析剖析【文本分析、分词器详解、自定义文本分析字段及分词器】

一.概述 Solr文本分析消除了索引词项与用户搜索词项之间的语言差异,让用户在搜索buying a new house时能找到类似的内容,例如:purchasing a new home这样的文档.如果搭配恰当,文本分析就能允许用户使用自然语言进行搜索,而无需考虑搜索词项的所有可能形式.毕竟谁也不想看到为了相似搜索而构造这样的查询表达式:buying house OR purchase home OR buying a home OR purchasing a house .... 用户可以使用

Lucene 4.4.0中常用的几个分词器

作者:ceclar123 推荐:刘超觉先 package bond.lucene.analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.core.WhitespaceAnalyzer; import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; import org.apache.lucene