Commits and NRT Search in SolrCloud – SolrCloud中的提交和NRT搜索

最后修改: 2017年 10月 20日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Solr is one of the most popular Lucene-based search solutions. It’s fast, distributed, robust, flexible and has an active developer community behind it. SolrCloud is the new, distributed version of Solr.

Solr是最流行的基于Lucene的搜索解决方案之一。它快速、分布式、稳健、灵活,背后有一个活跃的开发者社区。SolrCloud是Solr的新分布式版本

One of its key features here is the near real-time (NRT) search, i.e., documents being available for search as soon as they are indexed.

其关键特征之一是近实时(NRT)搜索,即文件一旦被编入索引,就可以进行搜索soon

2. Indexing in SolrCloud

2.SolrCloud中的索引

A collection in Solr is made up of multiple shards, and each shard has various replicas. One of the replicas of a shard is selected as the leader for that shard when a collection is created:

Solr中的一个集合是由多个分片组成的,每个分片有不同的副本。当一个集合被创建时,一个分片的副本被选为该分片的领导。

  • When a client tries to index a document, the document is first assigned a shard based on the hash of the id of the document
  • The client gets the URL of the leader of that shard from zookeeper, and finally, the index request is made to that URL
  • The shard leader indexes the document locally before sending it to replicas
  • Once the leader receives an acknowledgment from all active and recovering replicas, it returns confirmation to the indexing client application

When we index a document in Solr, it doesn’t go to the index directly. It’s written in what is called a tlog (transaction log). Solr uses the transaction log to ensure that documents are not lost before they are committed, in case of a system crash.

当我们在Solr中为一个文档建立索引时,它并不直接进入索引中。它被写入所谓的tlog(事务日志)。Solr使用事务日志来确保在系统崩溃的情况下,文件在提交之前不会丢失。

If the system crashes before the documents in the transaction log are committed, i.e., persisted to disk, the transaction log is replayed when the system comes back up, leading to zero loss of documents.

如果系统在事务日志中的文件被提交,即持久化到磁盘之前就崩溃了,那么当系统重新启动时,事务日志会被重新播放,导致文件的零损失。

Every index/update request is logged to the transaction log which continues to grow until we issue a commit.

每个索引/更新请求都会被记录到交易日志中,该日志会继续增长,直到我们发出提交。

3. Commits in SolrCloud

3.SolrCloud中的提交

A commit operation means finalizing a change and persisting that change on disk. SolrCloud provides two kinds of commit operations viz. a commit and a soft commit.

一个commit操作意味着最终确定一个变化并将该变化持久化在磁盘上。SolrCloud提供两种提交操作,即提交和软提交。

3.1. Commit (Hard Commit)

3.1.提交(硬提交)

A commit or hard commit is one in which Solr flushes all uncommitted documents in a transaction log to disk. The active transaction log is processed, and then a new transaction log file is opened.

提交或硬提交是指Solr将事务日志中所有未提交的文件刷入磁盘。活动的事务日志被处理,然后打开一个新的事务日志文件。

It also refreshes a component called a searcher so that the newly committed documents become available for searching. A searcher can be considered as a read-only view of all committed documents in the index.

它也会刷新一个叫做搜索器的组件,这样新提交的文件就可以用来搜索了。搜索器可以被视为索引中所有已提交文档的只读视图。

The commit operation can be done exclusively by the client by calling the commit API:

提交操作可以完全由客户端通过调用commit API来完成。

String zkHostString = "zkServer1:2181,zkServer2:2181,zkServer3:2181/solr";
SolrClient solr = new CloudSolrClient.Builder()
  .withZkHost(zkHostString)
  .build();
SolrInputDocument doc1 = new SolrInputDocument();
doc1.addField("id", "123abc");
doc1.addField("date", "14/10/2017");
doc1.addField("book", "To kill a mockingbird");
doc1.addField("author", "Harper Lee");
solr.add(doc1);
solr.commit();

Equivalently, it can be automated as autoCommit by specifying it in the solrconfig.xml file, see section 3.4.

等价地,它可以通过在 solrconfig.xml文件中指定它,作为autoCommit来自动化,见3.4节。

3.2. SoftCommit

3.2.软承诺

Softcommit has been added from Solr 4 onwards, primarily to support the NRT feature of SolrCloud. It’s a mechanism for making documents searchable in near real-time by skipping the costly aspects of hard commits.

软提交是从Solr 4开始加入的,主要是为了支持SolrCloud的NRT功能。它是一种机制,通过跳过硬提交的昂贵环节,使文件可以近乎实时地搜索。

During a softcommit, the transaction log is not truncated, it continues to grow. However, a new searcher is opened, which makes the documents since last softcommit visible for searching. Also, some of the top-level caches in Solr are invalidated, so it’s not a completely free operation.

在软提交期间,事务日志没有被截断,它继续增长。然而,一个新的搜索器被打开了,这使得自上次软提交以来的文件在搜索中是可见的。另外,Solr中的一些顶级缓存会被废止,所以这并不是一个完全自由的操作。

When we specify the maxTime for softcommit as 1000, it means that the document will be available in queries no later than 1 second from the time it got indexed.

当我们将软提交的maxTime指定为1000时,这意味着该文件将在获得索引后的1秒内可用于查询。

This feature grants SolrCloud the power of near real-time searching, as new documents can be made searchable even without committing them. Softcommit can be triggered only as autoSoftCommit by specifying it in solrconfig.xml file, see section 3.4.

这项功能赋予了SolrCloud近乎实时搜索的能力,因为即使不提交新的文档也可以使其成为可搜索的文档。软提交只能通过在solrconfig.xml文件中指定来触发autoSoftCommit,见3.4节。

3.3. Autocommit and Autosoftcommit

3.3.自动承诺和自动软承诺

The solrconfig.xml file is one of the most important configuration files in SolrCloud. It is generated at the time of collection creation. To enable autoCommit or autoSoftCommit, we need to update the following sections in the file:

solrconfig.xml文件是SolrCloud中最重要的配置文件之一。它是在创建集合的时候生成的。要启用autoCommitautoSoftCommit,我们需要更新该文件中的以下部分。

<autoCommit>
  <maxDocs>10000</maxDocs>
  <maxTime>30000</maxTime>
  <openSearcher>true</openSearcher>
</autoCommit>

<autoSoftCommit>
  <maxTime>6000</maxTime>
  <maxDocs>1000</maxDocs>
</autoSoftCommit>

maxTime: The number of milliseconds since the earliest uncommitted update after which the next commit/softcommit should happen.

maxTime: 从最早的未提交更新开始的毫秒数,在这之后应该发生下一次提交/软提交。

maxDocs: The number of updates that have occurred since the last commit and after which the next commit/softcommit should happen.

maxDocs: 自上次提交以来发生的更新数量,在这之后应该发生下一次提交/软提交。

openSearcher: This property tells Solr whether to open a new searcher after a commit operation or not. If it’s true, after a commit, the old searcher is closed, and a new searcher is opened, making the committed document visible for searching, If it’s false, the document won’t be available for searching after commit.

openSearcher: 这个属性告诉Solr是否在提交操作后打开一个新的搜索器。如果它是true,在提交之后,旧的搜索器被关闭,新的搜索器被打开,使得提交的文档可以被搜索到,如果它是false,在提交之后,该文档将不能被搜索。

4. Near Real-Time Search

4.接近实时的搜索

Near Real-Time Searching is achieved in Solr using a combination of commit and softcommit. As mentioned before, when a document is added to Solr, it won’t be visible in search results until it’s committed to the index.

近实时搜索在Solr中是通过提交和软提交的组合实现的。如前所述,当一个文档被添加到Solr中时,在提交到索引中之前,它不会在搜索结果中出现。

Normal commits are costly, which is why softcommits are useful. But, as softcommit doesn’t persist the documents, we do need to set the autocommit maxTime interval (or maxDocs) to a reasonable value, depending upon the load we are expecting.

正常提交的成本很高,这就是为什么软提交很有用。但是,由于软提交不会持久化文档,我们确实需要根据我们预期的负载,将自动提交maxTime间隔(或maxDocs)设置成一个合理的值。

4.1. Real-Time Gets

4.1.实时Gets

There is another feature provided by Solr which is in-fact real time – the get API. The get API can return us a document that is not even soft committed yet.

Solr提供的另一个功能实际上是实时的–get API。getAPI可以返回给我们一个甚至还没有软提交的文档。

It searches directly in the transaction logs if the document is not found in the index. So we can fire a get API call, immediately after the index call returns and we’ll still be able to retrieve the document.

如果在索引中没有找到文档,它会直接在事务日志中搜索。因此,我们可以在索引调用返回后立即启动一个get API调用,我们仍然能够检索到该文档。

However, like all too-good things, there is a catch here. We need to pass the id of the document in the get API call. Of course, we can provide other filter queries along with the id, but without id, the call doesn’t work:

然而,就像所有太好的东西一样,这里也有一个陷阱。我们需要在获取API调用中传递文档的id当然,我们可以在提供id的同时提供其他过滤器查询,但是如果没有id,调用将无法工作。

http://localhost:8985/solr/myCollection/get?id=1234&fq=name:baeldung

5. Conclusion

5.结论

Solr provides quite a bit of flexibility to us regarding tweaking the NRT capability. To get the best performance out of the server, we need to experiment with the values of commits and softcommits, based upon our use case and expected load.

在调整NRT能力方面,Solr为我们提供了相当多的灵活性。为了获得服务器的最佳性能,我们需要根据我们的使用情况和预期的负载,对提交和软提交的值进行试验。

We shouldn’t keep our commit interval too long, or else our transaction log will grow to a considerable size. We shouldn’t execute our softcommits too frequently though.

我们不应该将我们的提交间隔保持得太长,否则我们的事务日志将增长到一个相当大的规模。但我们也不应该过于频繁地执行我们的软提交。

It is also advised to do a proper performance testing of our system before we go to production. We should check if the documents are becoming searchable within our desired time interval.

我们也建议在投入生产之前对我们的系统做一个适当的性能测试。我们应该检查文件是否在我们期望的时间间隔内变得可搜索。