lucene分词流程

这一个星期花时间好好学习了一下lucene/solr,今天好好总结一下,写点文章记录点重要的东西,以便日后不至于丈二和尚摸不着头脑,

这一篇文章主要是简单的介绍一下lucene分词过程中的分词流程,和一些简单原理的讲解,希望不妥这处读者能够指正,不胜感激!!

(一)主要分词器

WhitespaceAnalyzer、StopAnalyzer、SimpleAnalyzer、KeywordAnalyzer,他们都的父类都是Analazer,在Analazer类中有一个抽象方法叫做tokenStream

package org.apache.lucene.analysis;

import java.io.Reader;
import java.io.IOException;
import java.io.Closeable;
import java.lang.reflect.Modifier;

import org.apache.lucene.util.CloseableThreadLocal;
import org.apache.lucene.store.AlreadyClosedException;

import org.apache.lucene.document.Fieldable;

/** An Analyzer builds TokenStreams, which analyze text.  It thus represents a
 *  policy for extracting index terms from text.
 *  <p>
 *  Typical implementations first build a Tokenizer, which breaks the stream of
 *  characters from the Reader into raw Tokens.  One or more TokenFilters may
 *  then be applied to the output of the Tokenizer.
 * <p>The {@code Analyzer}-API in Lucene is based on the decorator pattern.
 * Therefore all non-abstract subclasses must be final or their {@link #tokenStream}
 * and {@link #reusableTokenStream} implementations must be final! This is checked
 * when Java assertions are enabled.
 */
public abstract class Analyzer implements Closeable {
<span style="white-space:pre">	</span>//.....此段代码只提取了关键部分

  /** Creates a TokenStream which tokenizes all the text in the provided
   * Reader.  Must be able to handle null field name for
   * backward compatibility.
   */
  public abstract TokenStream tokenStream(String fieldName, Reader reader);

 }

TokenStream两个子类分别是Tokenizer和TokenFIlter,由字面意义上可以大致看出前者是把一段话变成一个个的语汇单元(analyzer把reader流传递),然后经过一系列的filter最后得到一个完整的TOkenStream,请看下图

下图流出来一下常见的tokenizer

下面简单的说一下simpletokenizer的流程。我们看看simple的源代码

public final class SimpleAnalyzer extends ReusableAnalyzerBase {

  private final Version matchVersion;

  /**
   * Creates a new {@link SimpleAnalyzer}
   * @param matchVersion Lucene version to match See {@link <a href="#version">above</a>}
   */
  public SimpleAnalyzer(Version matchVersion) {
    this.matchVersion = matchVersion;
  }

  /**
   * Creates a new {@link SimpleAnalyzer}
   * @deprecated use {@link #SimpleAnalyzer(Version)} instead
   */
  @Deprecated  public SimpleAnalyzer() {
    this(Version.LUCENE_30);
  }
  @Override
  protected TokenStreamComponents createComponents(final String fieldName,
      final Reader reader) {
    return new TokenStreamComponents(new LowerCaseTokenizer(matchVersion, reader));//这里覆盖了父类创建tokenstream的方法。并且传递进去一个lowercaseTokeniz//er这就是为什么英文字母都会转换成小写的原因,
  }
}

由上述类图可以看出他们都有一个父类叫做charTokenizer,这个类里面做什么的呢?明显这个类是用来分割字符的,在tokenStream类中有一个方法叫做increamentTOken

public abstract class TokenStream extends AttributeSource implements Closeable {

  /**
   * Consumers (i.e., {@link IndexWriter}) use this method to advance the stream to
   * the next token. Implementing classes must implement this method and update
   * the appropriate {@link AttributeImpl}s with the attributes of the next
   * token.
   * <P>
   * The producer must make no assumptions about the attributes after the method
   * has been returned: the caller may arbitrarily change it. If the producer
   * needs to preserve the state for subsequent calls, it can use
   * {@link #captureState} to create a copy of the current attribute state.
   * <p>
   * This method is called for every token of a document, so an efficient
   * implementation is crucial for good performance. To avoid calls to
   * {@link #addAttribute(Class)} and {@link #getAttribute(Class)},
   * references to all {@link AttributeImpl}s that this stream uses should be
   * retrieved during instantiation.
   * <p>
   * To ensure that filters and consumers know which attributes are available,
   * the attributes must be added during instantiation. Filters and consumers
   * are not required to check for availability of attributes in
   * {@link #incrementToken()}.
   *
   * @return false for end of stream; true otherwise
   */
  public abstract boolean incrementToken() throws IOException;

}

use this method to advance the stream to * the next token. Implementing classes must implement this method and update * the appropriate

注释已经说得很清楚了,这个方法必须要子类实现用来判断是否还有下一个词汇,返回一个boolean值,下面是chartokenizer的increamentToken方法

 @Override
  public final boolean incrementToken() throws IOException {
    clearAttributes();
    if(useOldAPI) // TODO remove this in LUCENE 4.0
      return incrementTokenOld();
    int length = 0;
    int start = -1; // this variable is always initialized
    char[] buffer = termAtt.buffer();
    while (true) {
      if (bufferIndex >= dataLen) {
        offset += dataLen;
        if(!charUtils.fill(ioBuffer, input)) { // read supplementary char aware with CharacterUtils
          dataLen = 0; // so next offset += dataLen won't decrement offset
          if (length > 0) {
            break;
          } else {
            finalOffset = correctOffset(offset);
            return false;
          }
        }
        dataLen = ioBuffer.getLength();
        bufferIndex = 0;
      }
      // use CharacterUtils here to support < 3.1 UTF-16 code unit behavior if the char based methods are gone
      final int c = charUtils.codePointAt(ioBuffer.getBuffer(), bufferIndex);
      bufferIndex += Character.charCount(c);

      if (isTokenChar(c)) {               // if it's a token char
        if (length == 0) {                // start of token
          assert start == -1;
          start = offset + bufferIndex - 1;
        } else if (length >= buffer.length-1) { // check if a supplementary could run out of bounds
          buffer = termAtt.resizeBuffer(2+length); // make sure a supplementary fits in the buffer
        }
        length += Character.toChars(normalize(c), buffer, length); // buffer it, normalized
        if (length >= MAX_WORD_LEN) // buffer overflow! make sure to check for >= surrogate pair could break == test
          break;
      } else if (length > 0)             // at non-Letter w/ chars
        break;                           // return 'em
    }

    termAtt.setLength(length);
    assert start != -1;
    offsetAtt.setOffset(correctOffset(start), finalOffset = correctOffset(start+length));
    return true;

  }

所以simpletokenizer就已经经过了多个tokenizer的包装(没有经过filter),可以看看stopanalazer的Tokenstreamcompent的方法

 @Override
  protected TokenStreamComponents createComponents(String fieldName,
      Reader reader) {
    final Tokenizer source = new LowerCaseTokenizer(matchVersion, reader);//经过了一个lowercase和一个stopfilter
    return new TokenStreamComponents(source, new StopFilter(matchVersion,
          source, stopwords));
  }

,经过了这些之后最后才能成为了一个tokenStream流。那么常见的filter有哪些呢?

这就是一些常见的filter,比如有stopfilter,lowercasefilter,等等,得到tokenizer的流之后再经过这些过滤器之后才能形成一个真正的tokenSTREAM

(二)、词汇信息如何保存

这里必须要提到3个类 CharTermAttribute(保存具体词汇),OffsetAttribute(保存词汇之间的偏移量),PositionIncrementAttribute(保存词与词之间的位置增量)。

有了这3个东西,就能确定一篇文档当中具体的位置,比如how are you  thank you这句话在lucene中其实上是这个样子的(位置偏亮错了应该是how后面应该是3,以此类推)

这些东西都有一个叫做AttributeSource的类决定,这个类中保存了这些信息。里面有一个叫做STATE的静态内部类。这个类中存储了当前stram类的位置信息。我们在以后的过程中可以用下面这个方法捕获当前状态

  /**
   * Captures the state of all Attributes. The return value can be passed to
   * {@link #restoreState} to restore the state of this or another AttributeSource.
   */
  public State captureState() {
    final State state = this.getCurrentState();
    return (state == null) ? null : (State) state.clone();
  }

我们能够得到这些词汇的位置信息之后,我们可以做很多事情。比如同义词(加上一个词使他的偏移量和位置增量与之相同),删除敏感词等等!第一篇总结就到此结束了。

转载请注明http://blog.csdn.net/a837199685/article

时间: 2024-11-06 10:11:47

lucene分词流程的相关文章

IKAnalyzer 分词流程粗览

没有开头语我会死啊~好的,IK是啥.怎么用相信看这篇文章的人都不需要我过多解释了,我也解释不好.下面开始正文: IK的官方结构图: 从上至下的来看: 最上层是我们不需要过度关心的,它们是一些Adapter供Lucene调用. IK Segmentation对应的主类应该是IKSegmenter,是IK工作的核心组件.提供了分词流程的控制. 词元处理子单元:多个不同算法的词元分词器,实现了对不同类型词元的识别. 词典:主要是给CJKSegmenter提供中文词识别能力的Dictionary封装,词

Lucene分词原理与方式

-------------------------------------------------------- lucene的分词_分词器的原理讲解 -------------------------------------------------------- 几个默认分词 SimpleAnalyzer StopAnalyzer WhitespaceAnalyzer(根据空格分词) StandardAnalyzer 分词流程 Reader  ---->Tokenizer---->大量的To

Lucene分词器之庖丁解牛

留意:这儿配置环境变量要重新启动体系后收效 我如今测验用的Lucene版本是lucene-2.4.0,它现已可以支撑中文号码大全分词,但它是关键词挖掘工具选用一元分词(逐字拆分)的方法,即把每一个汉字当作是一个词,这样会使树立的索引非常巨大,会影响查询功率.所以大多运用lucene的兄弟,都会思考使用其它的中文分词包,这儿我就介绍最为常用的"厨子解牛"分词包,当然它也是一个值得引荐的中文分词包. 这篇文章首要解说Lucene怎么结合"厨子解牛"分词包,在结合前,仍是

lucene 索引流程整理笔记

索引的原文档(Document). 为了方便说明索引创建过程,这里特意用两个文件为例: 文件一:Students should be allowed to go out with their friends, but not allowed to drink beer. 文件二:My friend Jerry went to school to see his students but found them drunk which is not allowed. 结果处的索引文件: Docume

lucene 分词器

分词器 作用:切分关键词的. 在什么地方使用到了:在建立索引和搜索时. 原文:An IndexWriter creates and maintains an index. 1,切分: An IndexWriter creates and maintains an index . 2,去除停用词 IndexWriter creates maintains index 3,转为小写 indexwriter creates maintains index 1 package cn.itcast.e_a

lucene 检索流程整理笔记

lucene 检索流程整理笔记

lucene分词器中的Analyzer,TokenStream, Tokenizer, TokenFilter

分词器的核心类: Analyzer:分词器 TokenStream: 分词器做好处理之后得到的一个流.这个流中存储了分词的各种信息,可以通过TokenStream有效的获取到分词单元. 以下是把文件流转换成分词流(TokenStream)的过程 首先,通过Tokenizer来进行分词,不同分词器有着不同的Tokenzier,Tokenzier分完词后,通过TokenFilter对已经分好词的数据进行过滤,比如停止词.过滤完之后,把所有的数据组合成一个TokenStream:以下这图就是把一个re

爬虫系统Lucene分词

思路:查询数据库中信息,查询出id和name把那么进行分词存入文件 package com.open1111.index; import java.io.IOException;import java.nio.file.Paths;import java.sql.Connection;import java.sql.PreparedStatement;import java.sql.ResultSet; import org.apache.log4j.Logger;import org.apac

Lucene分词器

Lucene分析器的基类为Analyzer,Analyzer包含两个核心组件:Tokenizer和 TokenFilter.自定义分析器必须实现Analyzer类的抽象方法createComponents(String)来定义TokenStreamComponents.在调用方法tokenStream(String, Reader)的时候,TokenStreamComponents会被重复使用. 自定义分析器首先需要继承Analyzer类,代码如下: public class HAnalyzer