Full-text Search with Solr – 用Solr进行全文搜索

最后修改: 2017年 3月 26日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this article, we’ll explore a fundamental concept in the Apache Solr search engine – full-text search.

在这篇文章中,我们将探讨Apache Solr搜索引擎中的一个基本概念–全文搜索。

The Apache Solr is an open source framework, designed to deal with millions of documents. We’ll go through the core capabilities of it with examples using Java library – SolrJ.

Apache Solr是一个开源框架,旨在处理数以百万计的文档。我们将通过使用Java库–SolrJ的例子来了解它的核心功能。

2. Maven Configuration

2.Maven配置

Given the fact that Solr is open source – we can simply download the binary and start the server separately from our application.

鉴于Solr是开源的–我们可以简单地下载二进制文件,并从我们的应用程序中单独启动服务器。

To communicate with the server, we’ll define the Maven dependency for the SolrJ client:

为了与服务器通信,我们将为SolrJ客户端定义Maven依赖。

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-solrj</artifactId>
    <version>6.4.2</version>
</dependency>

You can find the latest dependency here.

你可以找到最新的依赖性这里

3. Indexing Data

3.索引数据

To index and search data, we need to create a core; we’ll create one named item to index our data.

为了索引和搜索数据,我们需要创建一个core;我们将创建一个名为item的数据索引。

Before we do that, we need data to be indexed on the server, so that it becomes searchable.

在这之前,我们需要在服务器上对数据进行索引,使其成为可搜索的数据。

There are many different ways we can index data. We can use data import handlers to import data directly from relational databases, upload data with Solr Cell using Apache Tika or upload XML/XSLT, JSON and CSV data using index handlers.

我们有许多不同的方法来索引数据。我们可以使用数据导入处理程序直接从关系型数据库导入数据,使用Apache Tika的Solr Cell上传数据,或者使用索引处理程序上传XML/XSLT、JSON和CSV数据。

3.1. Indexing Solr Document

3.1.为Solr文档编制索引

We can index data into a core by creating SolrInputDocument. First, we need to populate the document with our data and then only call the SolrJ’s API to index the document:

我们可以通过创建SolrInputDocument将数据索引到core。首先,我们需要用我们的数据来填充这个文档,然后只调用SolrJ的API来索引这个文档。

SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", id);
doc.addField("description", description);
doc.addField("category", category);
doc.addField("price", price);
solrClient.add(doc);
solrClient.commit();

Note that id should naturally be unique for different items. Having an id of an already indexed document will update that document.

请注意,id对于不同的items来说自然应该是唯一的。拥有一个已经被索引的文件的id将更新该文件。

3.2. Indexing Beans

3.2.索引Bean

SolrJ provides APIs for indexing Java beans. To index a bean, we need to annotate it with the @Field annotations:

SolrJ提供了用于索引Java Bean的API。要索引一个Bean,我们需要用@Field 注释来注释它。

public class Item {

    @Field
    private String id;

    @Field
    private String description;

    @Field
    private String category;

    @Field
    private float price;
}

Once we have the bean, indexing is straight forward:

一旦我们有了这个Bean,索引就很简单了。

solrClient.addBean(item); 
solrClient.commit();

4. Solr Queries

4.Solr查询

Searching is the most powerful capability of Solr. Once we have the documents indexed in our repository, we can search for keywords, phrases, date ranges, etc. The results are sorted by relevance (score).

搜索是Solr的最强大的能力。一旦我们在资源库中建立了文档索引,我们就可以搜索关键词、短语、日期范围等。搜索结果按照相关性(得分)进行排序。

4.1. Basic Queries

4.1.基本查询

The server exposes an API for search operations. We can either call /select or /query request handlers.

服务器为搜索操作暴露了一个API。我们可以调用/select/query请求处理程序。

Let’s do a simple search:

让我们做一个简单的搜索。

SolrQuery query = new SolrQuery();
query.setQuery("brand1");
query.setStart(0);
query.setRows(10);

QueryResponse response = solrClient.query(query);
List<Item> items = response.getBeans(Item.class);

SolrJ will internally use the main query parameter q in its request to the server. The number of returned records will be 10, indexed from zero when start and rows are not specified.

SolrJ将在内部使用主查询参数q向服务器发出请求。返回的记录数将是10条,当start rows 没有被指定时,从零开始索引。

The search query above will look for any documents that contain the complete word “brand1” in any of its indexed fields. Note that simple searches are not case sensitive.

上面的搜索查询将寻找任何在其任何索引字段中包含完整单词“brand1”的文件。请注意,简单搜索是不区分大小写的。

Let’s look at another example. We want to search any word containing “rand”, that starts with any number of characters and ends with only one character. We can use wildcard characters * and ? in our query:

让我们看看另一个例子。我们想搜索任何含有“rand”的词,该词以任何数量的字符开始,并且只以一个字符结束。我们可以使用通配符*?在我们的查询中。

query.setQuery("*rand?");

Solr queries also support boolean operators like in SQL:

Solr查询也支持像SQL中的布尔运算符。

query.setQuery("brand1 AND (Washing OR Refrigerator)");

All boolean operators must be in all caps; those backed by the query parser are AND, OR, NOT, + and – .

所有的布尔运算符必须用大写字母表示;那些被查询分析器支持的运算符是AND, OR, NOT,+和-。

What’s more, if we want to search on specific fields instead of all indexed fields, we can specify these in the query:

更重要的是,如果我们想在特定字段而不是所有索引字段上进行搜索,我们可以在查询中指定这些字段。

query.setQuery("description:Brand* AND category:*Washing*");

4.2. Phrase Queries

4.2.短语查询

Up to this point, our code looked for keywords in the indexed fields. We can also do phrase searches on the indexed fields:

到此为止,我们的代码在索引字段中寻找关键词。我们也可以在索引字段上做短语搜索。

query.setQuery("Washing Machine");

When we have a phrase like “Washing Machine“, Solr’s standard query parser parses it to “Washing OR Machine“. To search for a whole phrase, we can only add the expression inside double quotes:

当我们有一个像”Washing Machine“这样的短语时,Solr的标准查询解析器将其解析为”Washing OR Machine“。要搜索整个短语,我们只能在双引号内添加表达式。

query.setQuery("\"Washing Machine\"");

We can use proximity search to find words within specific distances. If we want to find the words that are at least two words apart, we can use the following query:

我们可以使用近似搜索来查找特定距离内的单词。如果我们想找到至少相隔两个单词的单词,我们可以使用以下查询。

query.setQuery("\"Washing equipment\"~2");

4.3. Range Queries

4.3.范围查询

Range queries allow obtaining documents whose fields are between specific ranges.

范围查询允许获得字段在特定范围内的文档。

Let’s say we want to find items whose price ranges between 100 to 300:

比方说,我们想找到价格在100到300之间的物品。

query.setQuery("price:[100 TO 300]");

The query above will find all the elements whose price are between 100 to 300, inclusive. We can use “}” and “{” to exclude end points:

上面的查询将找到所有价格在100到300之间的元素,包括在内。我们可以使用”}“和”{“来排除端点。

query.setQuery("price:{100 TO 300]");

4.4. Filter Queries

4.4.过滤查询

Filter queries can be used to restrict the superset of results that can be returned. Filter query does not influence the score:

过滤器查询可用于限制可返回的结果超集。筛选查询不影响得分。

SolrQuery query = new SolrQuery();
query.setQuery("price:[100 TO 300]");
query.addFilterQuery("description:Brand1","category:Home Appliances");

Generally, the filter query will contain commonly used queries. Since they’re often reusable, they are cached to make the search more efficient.

一般来说,过滤查询将包含常用的查询。因为它们经常可以重复使用,所以它们被缓存起来,以使搜索更有效率。

5. Faceted Search

5.分面搜索

Faceting helps to arrange search results into group counts. We can facet fields, query or ranges.

Faceting有助于将搜索结果排列成组数。我们可以对字段、查询或范围进行分面。

5.1. Field Faceting

5.1.场面划分

For example, we want to get the aggregated counts of categories in the search result. We can add category field in our query:

例如,我们想获得搜索结果中类别的汇总计数。我们可以在查询中添加category字段。

query.addFacetField("category");

QueryResponse response = solrClient.query(query);
List<Count> facetResults = response.getFacetField("category").getValues();

The facetResults will contain counts of each category in the results.

facetResults将包含结果中每个类别的计数。

5.2. Query Faceting

5.2.查询分面

Query faceting is very useful when we want to bring back counts of subqueries:

当我们想带回子查询的计数时,查询分面是非常有用的。

query.addFacetQuery("Washing OR Refrigerator");
query.addFacetQuery("Brand2");

QueryResponse response = solrClient.query(query);
Map<String,Integer> facetQueryMap = response.getFacetQuery();

As a result, the facetQueryMap will have counts of facet queries.

因此, facetQueryMap将有facet查询的计数。

5.3. Range Faceting

5.3.范围分面

Range faceting is used to get the range counts in the search results. The following query will return the counts of price ranges between 100 and 251, gapped by 25:

范围分面是用来获取搜索结果中的范围计数的。下面的查询将返回100和251之间的价格范围的计数,间隔为25。

query.addNumericRangeFacet("price", 100, 275, 25);

QueryResponse response = solrClient.query(query);
List<RangeFacet> rangeFacets =  response.getFacetRanges().get(0).getCounts();

Apart from numeric ranges, Solr also supports date ranges, interval faceting, and pivot faceting.

除了数字范围外,Solr还支持日期范围、间隔分面和透视分面。

6. Hit Highlighting

6.点击突出显示

We may want the keywords in our search query to be highlighted in the results. This will be very helpful to get a better picture of the results. Let’s index some documents and define keywords to be highlighted:

我们可能希望我们的搜索查询中的关键词在结果中被突出显示。这将非常有助于我们更好地了解结果。让我们为一些文件建立索引并定义要突出显示的关键词。

itemSearchService.index("hm0001", "Brand1 Washing Machine", "Home Appliances", 100f);
itemSearchService.index("hm0002", "Brand1 Refrigerator", "Home Appliances", 300f);
itemSearchService.index("hm0003", "Brand2 Ceiling Fan", "Home Appliances", 200f);
itemSearchService.index("hm0004", "Brand2 Dishwasher", "Washing equipments", 250f);

SolrQuery query = new SolrQuery();
query.setQuery("Appliances");
query.setHighlight(true);
query.addHighlightField("category");
QueryResponse response = solrClient.query(query);

Map<String, Map<String, List<String>>> hitHighlightedMap = response.getHighlighting();
Map<String, List<String>> highlightedFieldMap = hitHighlightedMap.get("hm0001");
List<String> highlightedList = highlightedFieldMap.get("category");
String highLightedText = highlightedList.get(0);

We’ll get the highLightedText as “Home <em>Appliances</em>”. Please notice that the search keyword Appliances is tagged with <em>. Default highlighting tag used by Solr is <em>, but we can change this by setting the pre and post tags:

我们会得到highLightedText“Home <em>Appliance</em>”。请注意,搜索关键词Appliances被标记为<em>。Solr使用的默认高亮标签是<em>,但是我们可以通过设置prepost标签来改变。

query.setHighlightSimplePre("<strong>");
query.setHighlightSimplePost(">");

7. Search Suggestions

7.搜索建议

One of the important features that Solr supports are suggestions. If the keywords in the query contain spelling mistakes or if we want to suggest to autocomplete a search keyword, we can use the suggestion feature.

Solr支持的一个重要功能是建议。如果查询中的关键词包含拼写错误,或者我们想建议自动完成一个搜索关键词,我们可以使用建议功能。

7.1. Spell Checking

7.1.拼写检查

The standard search handler does not include spell checking component; it has to be configured manually. There are three ways to do it. You can find the configuration details in the official wiki page. In our example, we’ll use IndexBasedSpellChecker, which uses indexed data for keyword spell checking.

标准搜索处理程序不包括拼写检查组件;它必须手动配置。有三种方法可以做到这一点。你可以在官方的wiki页面中找到配置细节。在我们的例子中,我们将使用IndexBasedSpellChecker,它使用索引数据进行关键词拼写检查。

Let’s search for a keyword with spelling mistake:

让我们来搜索一个有拼写错误的关键词。

query.setQuery("hme");
query.set("spellcheck", "on");
QueryResponse response = solrClient.query(query);

SpellCheckResponse spellCheckResponse = response.getSpellCheckResponse();
Suggestion suggestion = spellCheckResponse.getSuggestions().get(0);
List<String> alternatives = suggestion.getAlternatives();
String alternative = alternatives.get(0);

Expected alternative for our keyword “hme” should be “home” as our index contains the term “home”. Note that spellcheck has to be activated before executing the search.

我们的关键词“hme”的预期选择应该是“home”,因为我们的索引包含“home “一词。注意,在执行搜索之前,必须激活拼写检查

7.2. Auto Suggesting Terms

7.2.自动推荐术语

We may want to get the suggestions of incomplete keywords to assist with the search. Solr’s suggest component has to be configured manually. You can find the configuration details in its official wiki page.

我们可能希望得到不完整的关键词的建议来协助搜索。Solr的建议组件必须要手动配置。你可以在其官方的wiki页面中找到配置细节。

We have configured a request handler named /suggest to handle suggestions. Let’s get suggestions for keyword “Hom”:

我们已经配置了一个名为/suggest的请求处理程序来处理建议。让我们获得对关键词“Hom”的建议。

SolrQuery query = new SolrQuery();
query.setRequestHandler("/suggest");
query.set("suggest", "true");
query.set("suggest.build", "true");
query.set("suggest.dictionary", "mySuggester");
query.set("suggest.q", "Hom");
QueryResponse response = solrClient.query(query);
        
SuggesterResponse suggesterResponse = response.getSuggesterResponse();
Map<String,List<String>> suggestedTerms = suggesterResponse.getSuggestedTerms();
List<String> suggestions = suggestedTerms.get("mySuggester");

The list suggestions should contain all words and phrases. Note that we have configured a suggester named mySuggester in our configuration.

列表建议应包含所有单词和短语。注意,我们在配置中已经配置了一个名为mySuggester的建议者。

8. Conclusion

8.结论

This article is a quick intro to the search engine’s capabilities and features of Solr.

这篇文章是对Solr的搜索引擎功能和特点的快速介绍。

We touched on many features, but these are of course just scratching the surface of what we can do with an advanced and mature search server such as Solr.

我们触及了许多功能,但这些当然只是触及了我们在Solr这样一个先进和成熟的搜索服务器上所能做的事情的表面。

The examples used here are available as always, over on GitHub.

这里使用的例子可以一如既往地在GitHub上获得,over on GitHub