solr的schema官方介绍

SchemaXml - Solr Wiki

Search:

Solr Wiki

Login

SchemaXml


  • Immutable Page
  • Comments
  • Info
  • Attachments
  • More Actions:
    Raw Text
    Print View
    Render as Docbook
    Delete Cache
    ------------------------
    Check Spelling
    Like Pages
    Local Site Map
    ------------------------
    Rename Page
    Copy Page
    Delete Page
    ------------------------
    My Pages
    Subscribe User
    ------------------------
    Remove Spam
    Revert to this revision
    Package Pages
    Sync Pages
    ------------------------
    Load
    Save
    SlideShow

The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields.

Analysis for Multiterm queries can be separately specified, see: Multiterm Query Analysis, which handles automatically lowercasing wildcard queries under most circumstances. Solr3.6 Solr4.0

A sample Solr schema.xml with detailed comments can be found in the Source Repository.

Contents

  1. Data Types
  2. Fields
    1. Recommended fields
    2. Common field options
    3. Dynamic fields
    4. Indexing same data in multiple fields
    5. Expert field options
  3. Miscellaneous Settings
    1. The Unique Key Field
    2. The Default Search Field
    3. Default query parser operator
    4. Copy Fields
    5. Similarity
    6. Poly Field Types
    7. Schema version attribute in the root node
    8. TODO

Data Types

The <types> section allows you to define a list of <fieldtype> declarations you wish to use in your schema, along with the underlying Solr class that should be used for that type, as well as the default options you want for fields that use that type.

Any subclass of FieldType may be used as a field type class, using either its full package name, or the "solr" alias if it is in the default Solr package. For common numeric types (integer, float, etc...) there are multiple implementations provided depending on your needs, please see SolrPlugins for information on how to ensure that your own custom Field Types can be loaded into Solr.

  • Common options that field types can have are...

    • sortMissingLast=true|false
    • sortMissingFirst=true|false
    • indexed=true|false
    • stored=true|false
    • multiValued=true|false
    • omitNorms=true|false
    • omitTermFreqAndPositions=true|false Solr1.4
    • omitPositions=true|false Solr3.4
    • positionIncrementGap=N
    • autoGeneratePhraseQueries=true|false (in schema version 1.4 and later this now defaults to false)
    • postingsFormat=<name of a postings format> Solr4.0, only works if you use a codec factory that is schema-aware such as SchemaCodecFactory. Please note that the postings formats used in a fieldType definition need to be in any of Solr lib directories. (For example, some useful (but unsupported) postings formats are available in the lucene-codecs JAR.). For detailed instructions on how to configure SimpleTextCodec, see: SimpleTextCodec Example

TextFields can also support Analyzers with highly configurable Tokenizers and Token Filters.

Field types that store text (TextField, StrField) support compression of stored contents:

  • compressed=true|false
  • compressThreshold=<integer>

compression support was removed in 1.4.1. There are (untested) patches for 3.x in https://issues.apache.org/jira/browse/SOLR-752.

compressThreshold is the minimum length required for text compression to be invoked. This applies only if compressed=true; a common pattern is to set compressThreshold on the field type definition, and turn compression on and off in the individual field definitions.

Fields

The <fields> section is where you list the individual <field> declarations you wish to use in your documents. Each <field> has a name that you will use to reference it when adding documents or executing searches, and an associated type which identifies the name of the fieldtype you wish to use for this field. There are various field options that apply to a field. These can be set in the field type declarations, and can also be overridden at an individual field‘s declaration. Field names should consist of alphanumeric or underscore characters only and not start with a digit. This is not currently strictly enforced, but other field names will not have first class support from all components and back compatibility is not guaranteed. Names with both leading and trailing underscores (e.g. _version_) are reserved.

Recommended fields

While these fields aren‘t strictly mandatory (Solr will run if you remove them fully), Bad Things happen in some situations if they aren‘t defined. We recommend that you leave these fields alone. If you don‘t use them, there‘s no appreciable penalty.

  • id - Almost all Solr installations have this field defined as the <uniqueKey> (see below).
  • _version_ Solr4.0 - This field is used for optimistic locking in SolrCloud and it enables Real Time Get. If you remove it you must also remove the transaction logging from solrconfig.xml, see Real Time Get.

Common field options

Common options that fields can have are...

  • default

    • The default value for this field if none is provided while adding documents
  • indexed=true|false
    • True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable.
  • stored=true|false
    • True if the value of the field should be retrievable during a search, or if you‘re using highlighting or MoreLikeThis.
  • compressed=true|false
    • True if this field should be stored using gzip compression. (This will only apply if the field type is compressible; among the standard field types, only TextField and StrField are.)
  • compressThreshold=<integer>
  • multiValued=true|false
    • True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document
  • omitNorms=true|false
    • This is arguably an advanced option.
    • Set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
  • termVectors=false|true <?> Solr 1.1
    • If set, include full term vector info.
    • If enabled, often also used with termPositions="true" and termOffsets="true".
    • To use interactively, requires TermVectorComponent
    • Corresponds to TV button in Luke, and V field attribute.
  • omitTermFreqAndPositions=true|false Solr1.4
    • If set, omit term freq, positions and payloads from postings for this field. This can be a performance boost for fields that don‘t require that information and reduces storage space required for the index. Queries that rely on position that are issued on a field with this option fail with an exception. Prior to Solr4.0 the queries would silently fail to find documents.
  • omitPositions=true|false Solr3.4
    • If set, omits positions, but keeps term frequencies

See also FieldOptionsByUseCase, which discusses how these options should be set in various circumstances. See SolrPerformanceFactors for how different options can affect Solr performance.

Dynamic fields

One of the powerful features of Lucene is that you don‘t have to pre-define every field when you first create your index. Even though Solr provides strong datatyping for fields, it still preserves that flexibility using "Dynamic Fields". Using <dynamicField> declarations, you can create field rules that Solr will use to understand what datatype should be used whenever it is given a field name that is not explicitly defined, but matches a prefix or suffix used in a dynamicField.

For example the following dynamic field declaration tells Solr that whenever it sees a field name ending in "_i" which is not an explicitly defined field, then it should dynamically create an integer field with that name...

    <dynamicField name="*_i"  type="integer"  indexed="true"  stored="true"/>

The glob-like pattern in the name attribute must have a "*" only at the start or the end. Longer patterns will be matched first. if equal size patterns both match, the first appearing in the schema will be used.

Indexing same data in multiple fields

Note that, with textual data, it will often make sense to take what‘s logically speaking a single field (e.g. product name) and index it into several different Solr fields, each with different field options and/or analyzers.

As an example, if I had a field with a list of authors, such as:

  • Schildt, Herbert; Wolpert, Lewis; Davies, P.

I might want to index the same data differently in three different fields (perhaps using the Solr copyField directive):

  • For searching: Tokenized, case-folded, punctuation-stripped:

    • schildt / herbert / wolpert / lewis / davies / p
  • For sorting: Untokenized, case-folded, punctuation-stripped:
    • schildt herbert wolpert lewis davies p
  • For faceting: Primary author only, using a solr.StringField:
    • Schildt, Herbert

(See also SolrFacetingOverview.)

Expert field options

The storage of Lucene term vectors can be triggered using the following field options:

  • termVectors=true|false
  • termPositions=true|false
  • termOffsets=true|false

These options can be used to accelerate highlighting and other ancillary functionality, but impose a substantial cost in terms of index size. They are not necessary for typical uses of Solr (phrase queries, etc., do not require these settings to be present).

Miscellaneous Settings

In addition to the <types> and <fields> sections of the schema, there are several other declarations that can appear in your schema.

The Unique Key Field

The <uniqueKey> declaration can be used to inform Solr that there is a field in your index which should be unique for all documents. If a document is added that contains the same value for this field as an existing document, the old document will be deleted.

It is not mandatory for a schema to have a uniqueKey field, but an overwhelming majority of them do. It shouldn‘t matter whether you rename this to something else (and change the <uniqueKey> value), but occasionally it has in the past. We recommend that you just leave this definition alone.

Note that if you have enabled the QueryElevationComponent in solrconfig.xml it requires the schema to have a uniqueKey of type StrField. It cannot be, for example, an int field.

The Default Search Field

The <defaultSearchField> is used by Solr when parsing queries to identify which field name should be searched in queries where an explicit field name has not been used. It is preferable to not use or rely on this setting; instead the request handler or query LocalParams for a search should specify the default field(s) to search on. This setting here can be omitted and it is being considered for deprecation.

Default query parser operator

The default operator used by Solr‘s query parser (SolrQueryParser) can be configured with <solrQueryParser defaultOperator="AND|OR"/>. The default operator is "OR" if unspecified. It is preferable to not use or rely on this setting; instead the request handler or query LocalParams should specify the default operator. This setting here can be omitted and it is being considered for deprecation.

Copy Fields

Any number of <copyField> declarations can be included in your schema, to instruct Solr that you want it to duplicate any data it sees in the "source" field of documents that are added to the index, in the "dest" field of that document. You are responsible for ensuring that the datatypes of the fields are compatible. The original text is sent from the "source" field to the "dest" field, before any configured analyzers for the originating or destination field are invoked.

This is provided as a convenient way to ensure that data is put into several fields, without needing to include the data in the update command multiple times. The copy is done at the stream source level and no copy feeds into another copy. The maxChars property may be used in a copyField declaration. This simply limits the number of characters copied. For example:

 <copyField source="body" dest="teaser" maxChars="300"/>

A common requirement is to copy or merge all input fields into a single solr field. This can be done as follows:-

 <copyField source="*" dest="text"/>

You can also automatically generate new field names by including an asterisk in both the source and destination fields. For example, if you have the following copyField directive:

 <copyField source="*_t" dest="*_t_facet" />

and then submit a field called author_t, the field‘s value will also be copied to another field called author_t_facet, where the word "author" was matched by the original asterisk in the source attribute, and then that pattern‘s match text was used to generate the destination field name, via the asterisk in the destination attribute *_t_facet, which serves as a field name template.

See also Copying Fields in the new Apache Solr Reference Guide

Similarity

A (global) <similarity> declaration can be used to specify a custom Similarity implementation that you want Solr to use when dealing with your index. A Similarity can be specified either by referring directly to the name of a class with a no-arg constructor...

<similarity class="org.apache.lucene.search.similarities.DefaultSimilarity"/>

...or by referencing a SimilarityFactory implementation, which may take optional init params....

<similarity class="solr.DFRSimilarityFactory">
  <str name="basicModel">P</str>
  <str name="afterEffect">L</str>
  <str name="normalization">H2</str>
  <float name="c">7</float>
</similarity>

Begining with Solr4.0, Similarity factories such as SchemaSimilarityFactory can also support specifying specific Similarity implementations on individual field types...

<types>
  <fieldType name="text_dfr" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <similarity class="solr.DFRSimilarityFactory">
      <str name="basicModel">I(F)</str>
      <str name="afterEffect">B</str>
      <str name="normalization">H2</str>
    </similarity>
  </fieldType>
  <fieldType name="text_ib" class="solr.TextField">
    <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
    <similarity class="solr.IBSimilarityFactory">
      <str name="distribution">SPL</str>
      <str name="lambda">DF</str>
      <str name="normalization">H2</str>
    </similarity>
  </fieldType>
  ...
</types>
<similarity class="solr.SchemaSimilarityFactory"/>

If no (global) <similarity> is configured in the schema.xml file, an implicit instance of DefaultSimilarityFactory is used.

Poly Field Types

Some FieldTypes can be "poly" field types. A Poly FieldType is one that can potentially create multiple Fields per "declared" field. Some examples include the LatLonType and PointType, both which use multiple indexed fields internally to represent what the user sees as a single value (e.g. "35.9,-79.0"). Another example is CurrencyField which indexes the value and currency symbol separately (e.g. "10,USD").

See the example schema, SpatialSearch and CurrencyField for more info on using these field types.

Schema version attribute in the root node

For the up-to-date documentation, see example example schema shipped with Solr

<schema name="example" version="1.5">
  <!-- attribute "name" is the name of this schema and is only used for display purposes.
       version="x.y" is Solr‘s version number for the schema syntax and
       semantics.  It should not normally be changed by applications.

       1.0: multiValued attribute did not exist, all fields are multiValued
            by nature
       1.1: multiValued attribute introduced, false by default
       1.2: omitTermFreqAndPositions attribute introduced, true by default
            except for text fields.
       1.3: removed optional field compress feature
       1.4: autoGeneratePhraseQueries attribute introduced to drive QueryParser
            behavior when a single string produces multiple tokens.  Defaults
            to off for version >= 1.4
       1.5: omitNorms defaults to true for primitive field types
            (int, float, boolean, string...)
     -->

See also Upgrade / Migrate Solr 3.x to Solr 4

TODO

  • Perhaps make a DTD for the schema.
  • Talk about omitNorms and positionIncrementGap wrt text fields
  • check whether defaultSearchField is also used by the DisMaxRequestHandler and not only by the StandardRequestHandler

SchemaXml (last edited 2014-08-25 23:11:51 by MarkBennett)

  • Immutable Page
  • Comments
  • Info
  • Attachments
  • More Actions:
    Raw Text
    Print View
    Render as Docbook
    Delete Cache
    ------------------------
    Check Spelling
    Like Pages
    Local Site Map
    ------------------------
    Rename Page
    Copy Page
    Delete Page
    ------------------------
    My Pages
    Subscribe User
    ------------------------
    Remove Spam
    Revert to this revision
    Package Pages
    Sync Pages
    ------------------------
    Load
    Save
    SlideShow
时间: 2024-08-15 14:36:08

solr的schema官方介绍的相关文章

solr的schema.xml配置介绍

schema.xml配置介绍如下: 常见的元素有以下几种: <field name="weight" type="float" indexed="true" stored="true"/> <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/&

Solr:Schema设计

Solr将数据以结构化的方式存入系统中,存储的过程中可以对数据建立索引,这个结构的定义就是通过schema.xml来配置的. <?xml version="1.0" encoding="UTF-8" ?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file

solr的schema.xml配置属性解释

schema.xml做什么? SOLR加载数据,创建索引和数据时,核心数据结构的配置文件是schema.xml,该配置文件主要用于配置数据源,字段类型定义,搜索类型定义等.schema.xml的配置直接影响搜索结果的准确性与效率. <types></types>节点 types节点主要用于搜索类型的定义,这里给出常用类型的定义. 1 <fieldType name="string" class="solr.StrField" sortM

GitHub官方介绍

官方链接地址 http://guides.github.com/activities/hello-world/ Hello World 项目在计算机编程界是一项历史悠久的传统.当你开始学习一些新的东西时,这个项目是一项简单的练习.让我们开始用GitHub开始吧! 你可以学到怎样去做: 创造并使用一个储存库 开始并管理一个新的分支 对一个文件进行改动并且把他们推送到GitHub作为提交 打开并合并一个提取请求 什么是GitHub? GitHub是一个版本控制和协作的代码管理平台.它可以让你和他人在

solr配置-Schema.xml

可参考配置:http://wiki.apache.org/solr/SchemaXml(基本上文档上面讲的已经很详细了) 先来看一下Schema.xml都有什么配置 1,uniqueKey 2,n多name不一样的fieldType 3,各种field :field,dynamicField,copyField 4,默认被注释掉的defaultSearchField,solrQueryParser,Similarity 下面来看一下具体都什么意思: 1,uniqueKey:文档的唯一标识.唯一键

Solr中schema.xml的解释

接Solr-4.10.2与Tomcat整合.schema.xml位于D:\solr\data\solr\collection1\conf\中.1.fieldType节点    name: FieldType的名称    class: 指向org.apache.solr.analysis包里面对应的class名称,用来定义这个类型的行为    omitNorms: 字段检索时被省略相关的规范    positionIncrementGap:定义在同一个文档中此类型数据的空白间隔,避免短语匹配错误 

3 Solr配置文件 schema.xml

1 添加自己的分词器(mmseg4j) 意思是textCommplex 这个类型,用的是 com.chenlb.mmseg4j.solr.MMSegTokenizerFactory 这个分词器,词库是用到的solr.home目录下面的dic目录, 但是mmseg4j.jar 1.9 把词库包进去了,想要用外面的,需要把里面的删除掉, <filter class="solr.LowerCaseFilterFactory"/>  下面可选择性的添加一些自己的过滤器 <fi

solr配置schema.xml学习

solr创建索引.添加数据的关键是配置schema.xml文件,该文件中主要是完成配置数据源.索引字段.数据类型等定义.同时,该文件的配置直接影响到solr搜索的效率和准确性. 一.搜索类型FileType name:指的是FileType的名字 class:指向org.apache.solr.analysis包里面对应的class名称,用来定义这个类型的行为 <types> <fieldType name="string" class="solr.StrF

solr的管控台介绍

一.添加---通过管控台添加数据到solr索引库 1.打开solrCore目录中的conf下的schema.xml文件,查看Solr索引库提供的内置的字段 和字段类型,查看标签field和fieldtype schema.xml作用: 声明solr的字段以及字段类型 field标签:声明自定义的字段 fieldtype标签:声明自定义的字段类型 <!-- indexed=true表示solr会为此字段添加索引 --> <field name="id" type=&qu