弄清楚Solr Nodes, Cores, Clusters and Leaders , Shards and Indexing Data

https://cwiki.apache.org/confluence/display/solr/Nodes%2C+Cores%2C+Clusters+and+Leaders

Nodes and Cores

In SolrCloud, a node is Java Virtual Machine instance running Solr, commonly called a server. Each Solr core can also be considered a node. Any node can contain both an instance of Solr and various kinds of data.

A Solr core is basically an index of the text and fields found in documents. A single Solr instance can contain multiple "cores", which are separate from each other based on local criteria. It might be that they are going to provide different search interfaces to users (customers in the US and customers in Canada, for example), or they have security concerns (some users cannot have access to some documents), or the documents are really different and just won‘t mix well in the same index (a shoe database and a dvd database).

When you start a new core in SolrCloud mode, it registers itself with ZooKeeper. This involves creating an Ephemeral node that will go away if the Solr instance goes down, as well as registering information about the core and how to contact it (such as the base Solr URL, core name, etc). Smart clients and nodes in the cluster can use this information to determine who they need to talk to in order to fulfill a request.

New Solr cores may also be created and associated with a collection via CoreAdmin. Additional cloud-related parameters are discussed in the Parameter Reference page. Terms used for the CREATE action are:

  • collection: the name of the collection to which this core belongs. Default is the name of the core.
  • shard: the shard id this core represents. (Optional: normally you want to be auto assigned a shard id.)
  • collection.<param>=<value>: causes a property of <param>=<value> to be set if a new collection is being created. For example, use collection.configName=<configname> to point to the config for a new collection.

For example:

curl  ‘http://localhost:8983/solr/admin/cores?

     action=CREATE&name=mycore&collection=my_collection&shard=shard2‘

Clusters

A cluster is set of Solr nodes managed by ZooKeeper as a single unit. When you have a cluster, you can always make requests to the cluster and if the request is acknowledged, you can be sure that it will be managed as a unit and be durable, i.e., you won‘t lose data. Updates can be seen right after they are made and the cluster can be expanded or contracted.

Creating a Cluster

A cluster is created as soon as you have more than one Solr instance registered with ZooKeeper. The section Getting Started with SolrCloud reviews how to set up a simple cluster.

Resizing a Cluster

Clusters contain a settable number of shards. You set the number of shards for a new cluster by passing a system property, numShards, when you start up Solr. The numShards parameter must be passed on the first startup of any Solr node, and is used to auto-assign which shard each instance should be part of. Once you have started up more Solr nodes than numShards, the nodes will create replicas for each shard, distributing them evenly across the node, as long as they all belong to the same collection.

To add more cores to your collection, simply start the new core. You can do this at any time and the new core will sync its data with the current replicas in the shard before becoming active.

You can also avoid numShards and manually assign a core a shard ID if you choose.

The number of shards determines how the data in your index is broken up, so you cannot change the number of shards of the index after initially setting up the cluster.

However, you do have the option of breaking your index into multiple shards to start with, even if you are only using a single machine. You can then expand to multiple machines later. To do that, follow these steps:

  1. Set up your collection by hosting multiple cores on a single physical machine (or group of machines). Each of these shards will be a leader for that shard.
  2. When you‘re ready, you can migrate shards onto new machines by starting up a new replica for a given shard on each new machine.
  3. Remove the shard from the original machine. ZooKeeper will promote the replica to the leader for that shard.

Leaders and Replicas

The concept of a leader is similar to that of master when thinking of traditional Solr replication. The leader is responsible for making sure the replicas are up to date with the same information stored in the leader.

However, with SolrCloud, you don‘t simply have one master and one or more "slaves", instead you likely have distributed your search and index traffic to multiple machines. If you have bootstrapped Solr with numShards=2, for example, your indexes are split across both shards. In this case, both shards are considered leaders. If you start more Solr nodes after the initial two, these will be automatically assigned as replicas for the leaders.

Replicas are assigned to shards in the order they are started the first time they join the cluster. This is done in a round-robin manner, unless the new node is manually assigned to a shard with the shardId parameter during startup. This parameter is used as a system property, as in -DshardId=1, the value of which is the ID number of the shard the new node should be attached to.

On subsequent restarts, each node joins the same shard that it was assigned to the first time the node was started (whether that assignment happened manually or automatically). A node that was previously a replica, however, may become the leader if the previously assigned leader is not available.

Consider this example:

  • Node A is started with the bootstrap parameters, pointing to a stand-alone ZooKeeper, with the numShards parameter set to 2.
  • Node B is started and pointed to the stand-alone ZooKeeper.

Nodes A and B are both shards, and have fulfilled the 2 shard slots we defined when we started Node A. If we look in the Solr Admin UI, we‘ll see that both nodes are considered leaders (indicated with a solid blank circle).

  • Node C is started and pointed to the stand-alone ZooKeeper.

Node C will automatically become a replica of Node A because we didn‘t specify any other shard for it to belong to, and it cannot become a new shard because we only defined two shards and those have both been taken.

  • Node D is started and pointed to the stand-alone ZooKeeper.

Node D will automatically become a replica of Node B, for the same reasons why Node C is a replica of Node A.

Upon restart, suppose that Node C starts before Node A. What happens? Node C will become the leader, while Node A becomes a replica of Node C.

When your data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it‘s the set of all nodes containing that section of the index.

A shard is a way of splitting a core over a number of "servers", or nodes. For example, you might have a shard for data that represents each state, or different categories that are likely to be searched independently, but are often combined.

Before SolrCloud, Solr supported Distributed Search, which allowed one query to be executed across multiple shards, so the query was executed against the entire Solr index and no documents would be missed from the search results. So splitting the core across shards is not exclusively a SolrCloud concept. There were, however, several problems with the distributed approach that necessitated improvement with SolrCloud:

  1. Splitting of the core into shards was somewhat manual.
  2. There was no support for distributed indexing, which meant that you needed to explicitly send documents to a specific shard; Solr couldn‘t figure out on its own what shards to send documents to.
  3. There was no load balancing or failover, so if you got a high number of queries, you needed to figure out where to send them and if one shard died it was just gone.

SolrCloud fixes all those problems. There is support for distributing both the index process and the queries automatically, and ZooKeeper provides failover and load balancing. Additionally, every shard can also have multiple replicas for additional robustness.

In SolrCloud there are no masters or slaves. Instead, there are leaders and replicas. Leaders are automatically elected, initially on a first-come-first-served basis, and then based on the Zookeeper process described at http://zookeeper.apache.org/doc/trunk/recipes.html#sc_leaderElection..

If a leader goes down, one of its replicas is automatically elected as the new leader. As each node is started, it‘s assigned to the shard with the fewest replicas. When there‘s a tie, it‘s assigned to the shard with the lowest shard ID.

When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader.

  • If the machine is a replica, the document is forwarded to the leader for processing.
  • If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas.

Document Routing

Solr offers the ability to specify the router implementation used by a collection by specifying the router.name parameter when creating your collection. If you use the "compositeId" router, you can send documents with a prefix in the document ID which will be used to calculate the hash Solr uses to determine the shard a document is sent to for indexing. The prefix can be anything you‘d like it to be (it doesn‘t have to be the shard name, for example), but it must be consistent so Solr behaves consistently. For example, if you wanted to co-locate documents for a customer, you could use the customer name or ID as the prefix. If your customer is "IBM", for example, with a document with the ID "12345", you would insert the prefix into the document id field: "IBM!12345". The exclamation mark (‘!‘) is critical here, as it distinguishes the prefix used to determine which shard to direct the document to.

Then at query time, you include the prefix(es) into your query with the _route_ parameter (i.e., q=solr&_route_=IBM!) to direct queries to specific shards. In some situations, this may improve query performance because it overcomes network latency when querying all the shards.

The _route_ parameter replaces shard.keys, which has been deprecated and will be removed in a future Solr release.

The compositeId router supports prefixes containing up to 2 levels of routing. For example: a prefix routing first by region, then by customer: "USA!IBM!12345"

Another use case could be if the customer "IBM" has a lot of documents and you want to spread it across multiple shards. The syntax for such a use case would be : "shard_key/num!document_id" where the /num is the number of bits from the shard key to use in the composite hash.

So "IBM/3!12345" will take 3 bits from the shard key and 29 bits from the unique doc id, spreading the tenant over 1/8th of the shards in the collection. Likewise if the num value was 2 it would spread the documents across 1/4th the number of shards. At query time, you include the prefix(es) along with the number of bits into your query with the _route_ parameter (i.e., q=solr&_route_=IBM/3!) to direct queries to specific shards.

If you do not want to influence how documents are stored, you don‘t need to specify a prefix in your document ID.

If you created the collection and defined the "implicit" router at the time of creation, you can additionally define a router.field parameter to use a field from each document to identify a shard where the document belongs. If the field specified is missing in the document, however, the document will be rejected. You could also use the _route_ parameter to name a specific shard.

Shard Splitting

When you create a collection in SolrCloud, you decide on the initial number shards to be used. But it can be difficult to know in advance the number of shards that you need, particularly when organizational requirements can change at a moment‘s notice, and the cost of finding out later that you chose wrong can be high, involving creating new cores and re-indexing all of your data.

The ability to split shards is in the Collections API. It currently allows splitting a shard into two pieces. The existing shard is left as-is, so the split action effectively makes two copies of the data as new shards. You can delete the old shard at a later time when you‘re ready.

More details on how to use shard splitting is in the section on the Collections API.

Ignoring Commits from Client Applications in SolrCloud

In most cases, when running in SolrCloud mode, indexing client applications should not send explicit commit requests. Rather, you should configure auto commits with openSearcher=false and auto soft-commits to make recent updates visible in search requests. This ensures that auto commits occur on a regular schedule in the cluster. To enforce a policy where client applications should not send explicit commits, you should update all client applications that index data into SolrCloud. However, that is not always feasible, so Solr provides the IgnoreCommitOptimizeUpdateProcessorFactory, which  allows you to ignore explicit commits and/or optimize requests from client applications without having refactor your client application code.  To activate this request processor you‘ll need to add the following to your solrconfig.xml:

<updateRequestProcessorChain name="ignore-commit-from-client" default="true">

  <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">

    <int name="statusCode">200</int>

  </processor>

  <processor class="solr.LogUpdateProcessorFactory" />

  <processor class="solr.DistributedUpdateProcessorFactory" />

  <processor class="solr.RunUpdateProcessorFactory" />

</updateRequestProcessorChain>

As shown in the example above, the processor will return 200 to the client but will ignore the commit / optimize request. Notice that you need to wire-in the implicit processors needed by SolrCloud as well, since this custom chain is taking the place of the default chain.

In the following example, the processor will raise an exception with a 403 code with a customized error message:

<updateRequestProcessorChain name="ignore-commit-from-client" default="true">

  <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">

    <int name="statusCode">403</int>

    <str name="responseMessage">Thou shall not issue a commit!</str>

  </processor>

  <processor class="solr.LogUpdateProcessorFactory" />

  <processor class="solr.DistributedUpdateProcessorFactory" />

  <processor class="solr.RunUpdateProcessorFactory" />

</updateRequestProcessorChain>

Lastly, you can also configure it to just ignore optimize and let commits pass thru by doing:

<updateRequestProcessorChain name="ignore-optimize-only-from-client-403">

  <processor class="solr.IgnoreCommitOptimizeUpdateProcessorFactory">

    <str name="responseMessage">Thou shall not issue an optimize, but commits are OK!</str>

    <bool name="ignoreOptimizeOnly">true</bool>

  </processor>

  <processor class="solr.RunUpdateProcessorFactory" />

</updateRequestProcessorChain>

时间: 2024-10-12 14:14:55

弄清楚Solr Nodes, Cores, Clusters and Leaders , Shards and Indexing Data的相关文章

SolrCloud:根据Solr Wiki的译文

本文是作者根据Apache Solr Document的译文,翻译不正确或者理解不到位的地方欢迎大家指正!谢谢! Nodes, Cores, Cluster and Leaders Nodes and Cores 在SolrCloud中,一个node就是一个JVM运行Solr的实例,通常称之为server.每个Solrcore都可以被当作一个node.任何一个node都可以包含一个Solr的实例和多样化的数据在其中. Solr core中存储了基于一篇文章中发现的文本内容和字段的索引.一个单独的

What is SolrCloud? (And how does it compare to master-slave?)

What is SolrCloud? (And how does it compare to master-slave?) SolrCloud is a set of new features and functionality added in Solr 4.0 to enable a new way of creating durable, highly available Solr clusters with commodity hardware. While similar in man

Solr初始化源码分析-Solr初始化与启动

用solr做项目已经有一年有余,但都是使用层面,只是利用solr现有机制,修改参数,然后监控调优,从没有对solr进行源码级别的研究.但是,最近手头的一个项目,让我感觉必须把solrn内部原理和扩展机制弄熟,才能把这个项目做好.今天分享的就是:Solr是如何启动并且初始化的.大家知道,部署solr时,分两部分:一.solr的配置文件.二.solr相关的程序.插件.依赖lucene相关的jar包.日志方面的jar.因此,在研究solr也可以顺着这个思路:加载配置文件.初始化各个core.初始化各个

Solr官方文档翻译-About &amp; Getting Started

关于(About) 官方文档介绍了所有的Apache Solr实现的重要特性和功能.它是免费的,可以到http://lucene.apache.org/solr/下载. 为了更加的深入和广泛,设计成一个较高水平的文档,而不是一个菜谱.文档定位到比较广泛的需求,帮助新手和经验丰富的开发人员扩展他们的应用,帮助他们定位和解决问题.在应用开发生命周期中,关于任何一点关于Solr的内容都可以使用这个文档,会得到最权威的信息. 这里默认你熟悉搜索的概念并且能够读懂XML,你不需要是个Java程序员,但是有

zookeeper和solr搭建集群分片查询

这几天双十一弄得不要不要的.各种困.出差有一些时间.晚上回头摆弄摆弄.白天不忙就是找个地方想想写写.就这样一周多过去了.好了.不扯了入正题. 1 .环境搭建 MacBook pro 15款840 OS X 10.10.5 solr-5.2.1.tgz zookeeper-3.4.6.tar.gz VMWare Fusion8 Centos 6.7 2 .搭建solr集群. 在之前说过zookeeper集群的搭建,所以在这就别啰嗦了.基本是一样的.不过因为之前搭建过rabbitmq集群,改了一些配

solr:入门

Solr是一个高性能的.带有高级特性,比如faceting(arranging search results in columns with numeric counts of key terms)的搜索程序.Solr构建于Lucene之上.Lucene是一个提供索引.查询.拼写检查.关键字高亮和分词技术的Java库.Solr和Lucene都是由Apache Software Foundation管理. Solr搜索服务器提供了ready-to-use搜索平台.本章节将了解Solr.运行Solr

solr安装-tomcat+solrCloud构建稳健solr集群

solr安装-tomcat+solrCloud构建稳健solr集群 2014-05-29 12:17 11985人阅读 评论(2) 收藏 举报  分类: solr(1)  版权声明:本文为博主原创文章,未经博主允许不得转载. solrCloud的搭建可以有两种方式:使用solr内嵌的jetty来搭建:使用外部web容器tomcat来搭建.对于使用jett来搭建参考solr官方的手册照着做肯定ok,下面我主要讲的是如何使用tomcat来搭建solrCloud. 废话不多说,开始我们的工作! 1.搭

Solr In Action 中文版 第一章 (二)

Solr到底是什么? 在本节中,我们通过从头设计一个搜索应用来介绍Solr的关键组件.这个过程将有助于你理解Solr的功能,以及设计这些功能的初衷.不过在我们开始介绍Solr的功能特性之前,还是要先澄清一下Solr并不具有的一些性质: 1)  Solr并不是一个像Google或是Bing那样的web搜索引擎 2)  Solr和网站优化中经常提到的搜索引擎SEO优化没有任何关系 好了,现在假设我们准备为潜在的购房客户设计一个不动产搜索的网络应用.该应用的核心用例场景是通过网页浏览器来搜索全美国范围

Solr之搭建Solr5.2.1服务并从Mysql上导入数据

一.开启Solr服务 1.首先从solr官网下载solr-5.2.1.tgz包,解压之后为solr-5.2.1. 2.读取README.txt可知通过bin/solr start命令开启solr服务,当然可以将solr-5.2.1/bin加入环境变量里面.此时开启的服务是放在jetty下的服务,也可以放在Tomcat下,只是感觉那样挺麻烦,还需要再下载一个Tomcat包. 3.开启服务之后,默认是开启8983端口,此时就可以使用localhost:8983/solr/进行访问了:如果不能访问,通