Elasticsearch Queries with Spring Data – 使用Spring Data的Elasticsearch查询

最后修改: 2016年 3月 18日

中文/混合/英文(键盘快捷键:t)

1. Introduction

1.介绍

In a previous article, we demonstrated how to configure and use Spring Data Elasticsearch for a project. In this article, we will examine several query types offered by Elasticsearch and we’ll also talk about field analyzers and their impact on search results.

前一篇文章中,我们演示了如何为一个项目配置和使用Spring Data Elasticsearch。在这篇文章中,我们将研究Elasticsearch提供的几种查询类型,我们还将讨论字段分析器及其对搜索结果的影响。

2. Analyzers

2.分析器

All stored string fields are, by default, processed by an analyzer. An analyzer consists of one tokenizer and several token filters, and is usually preceded by one or more character filters.

默认情况下,所有存储的字符串字段都由分析器处理。一个分析器由一个标记化器和几个标记过滤器组成,前面通常有一个或多个字符过滤器。

The default analyzer splits the string by common word separators (such as spaces or punctuation) and puts every token in lowercase. It also ignores common English words.

默认的分析器通过常见的单词分隔符(如空格或标点符号)来分割字符串,并将每个标记放在小写字母中。它还忽略了常见的英语单词。

Elasticsearch can also be configured to regard a field as analyzed and not-analyzed at the same time.

Elasticsearch也可以被配置为同时将一个字段视为已分析和未分析。

For example, in an Article class, suppose we store the title field as a standard analyzed field. The same field with the suffix verbatim will be stored as a not-analyzed field:

例如,在一个文章类中,假设我们将标题字段存储为一个标准的分析字段。带有后缀verbatim的相同字段将被存储为一个未分析字段。

@MultiField(
  mainField = @Field(type = Text, fielddata = true),
  otherFields = {
      @InnerField(suffix = "verbatim", type = Keyword)
  }
)
private String title;

Here, we apply the @MultiField annotation to tell Spring Data that we would like this field to be indexed in several ways. The main field will use the name title and will be analyzed according to the rules described above.

在这里,我们应用@MultiField注解来告诉Spring Data,我们希望这个字段能以多种方式被索引。主字段将使用title的名字,并将根据上面描述的规则进行分析。

But we also provide a second annotation, @InnerField, which describes an additional indexing of the title field. We use FieldType.keyword to indicate that we do not want to use an analyzer when performing the additional indexing of the field, and that this value should be stored using a nested field with the suffix verbatim.

但是我们也提供了第二个注解,@InnerField,它描述了title字段的额外索引。我们使用FieldType.keyword来表示我们在执行字段的额外索引时不希望使用分析器,并且这个值应该使用后缀为verbatim的嵌套字段来存储。

2.1. Analyzed Fields

2.1.分析的字段

Let’s look at an example. Suppose an article with the title “Spring Data Elasticsearch” is added to our index. The default analyzer will break up the string at the space characters and produce lowercase tokens: “spring“, “data”, and “elasticsearch“.

让我们看一个例子。假设一篇标题为 “Spring Data Elasticsearch “的文章被添加到我们的索引中。默认的分析器会在空格字符处将字符串打散,并产生小写的标记。”spring“,”data”,和”elasticsearch“。

Now we may use any combination of these terms to match a document:

现在我们可以使用这些术语的任何组合来匹配一个文件。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "elasticsearch data"))
  .build();

2.2. Non-analyzed Fields

2.2.非分析领域

A non-analyzed field is not tokenized, so can only be matched as a whole when using match or term queries:

非分析字段没有被标记,所以在使用匹配或术语查询时只能作为一个整体进行匹配。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title.verbatim", "Second Article About Elasticsearch"))
  .build();

Using a match query, we may only search by the full title, which is also case-sensitive.

使用匹配查询,我们只能通过完整的标题进行搜索,这也是区分大小写的。

3. Match Query

3.匹配查询

A match query accepts text, numbers and dates.

一个匹配查询接受文本、数字和日期。

There are three type of “match” query:

有三种类型的 “匹配 “查询。

  • boolean
  • phrase and
  • phrase_prefix

In this section, we will explore the boolean match query.

在本节中,我们将探讨boolean匹配查询。

3.1. Matching With Boolean Operators

3.1.用布尔运算符进行匹配

boolean is the default type of a match query; you can specify which boolean operator to use (or is the default):

boolean是匹配查询的默认类型;你可以指定使用哪个布尔运算符(or是默认的)。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title","Search engines").operator(Operator.AND))
  .build();
SearchHits<Article> articles = elasticsearchTemplate()
  .search(searchQuery, Article.class, IndexCoordinates.of("blog"));

This query would return an article with the title “Search engines” by specifying two terms from the title with and operator. But what will happen if we search with the default (or) operator when only one of the terms matches?

通过用and操作符指定标题中的两个术语,这个查询将返回一篇标题为 “搜索引擎 “的文章。但是,如果我们使用默认的(or)操作符进行搜索,当只有一个术语匹配时,会发生什么?

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "Engines Solutions"))
  .build();
SearchHits<Article> articles = elasticsearchTemplate()
  .search(searchQuery, Article.class, IndexCoordinates.of("blog"));
assertEquals(1, articles.getTotalHits());
assertEquals("Search engines", articles.getSearchHit(0).getContent().getTitle());

The “Search engines” article is still matched, but it will have a lower score because not all of the terms matched.

搜索引擎“这篇文章仍然被匹配,但它的分数会比较低,因为不是所有的术语都匹配。

The sum of the scores of each matching term add up to the total score of each resulting document.

每个匹配词的分数之和加起来就是每个结果文件的总分。

There may be situations in which a document containing a rare term entered in the query will have higher rank than a document that contains several common terms.

可能会出现这样的情况:一个包含查询中输入的罕见术语的文件将比一个包含几个常见术语的文件有更高的排名。

3.2. Fuzziness

3.2.模糊性

When the user makes a typo in a word, it is still possible to match it with a search by specifying a fuzziness parameter, which allows inexact matching.

当用户在一个单词中出现错别字时,仍然可以通过指定一个fuzziness 参数来进行搜索匹配,这允许不精确的匹配。

For string fields, fuzziness means the edit distance: the number of one-character changes that need to be made to one string to make it the same as another string.

对于字符串字段,模糊性指的是编辑距离:为了使一个字符串与另一个字符串相同,需要对其进行一个字符的修改的数量。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchQuery("title", "spring date elasticsearch")
  .operator(Operator.AND)
  .fuzziness(Fuzziness.ONE)
  .prefixLength(3))
  .build();

The prefix_length parameter is used to improve performance. In this case, we require that the first three characters should match exactly, which reduces the number of possible combinations.

prefix_length参数用于提高性能。在这种情况下,我们要求前三个字符应完全匹配,这减少了可能的组合数量。

5. Phrase Search

5.短语搜索</strong

Phase search is stricter, although you can control it with the slop parameter. This parameter tells the phrase query how far apart terms are allowed to be while still considering the document a match.

阶段性搜索是比较严格的,尽管你可以用slop参数来控制。这个参数告诉短语查询允许相隔多远的术语,同时仍然认为该文件是匹配的。

In other words, it represents the number of times you need to move a term in order to make the query and document match:

换句话说,它代表你需要移动一个术语的次数,以使查询和文件相匹配。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(matchPhraseQuery("title", "spring elasticsearch").slop(1))
  .build();

Here the query will match the document with the title “Spring Data Elasticsearch” because we set the slop to one.

在这里,查询将匹配标题为”Spring Data Elasticsearch“的文件,因为我们将slop设为1。

6. Multi Match Query

6.多重匹配查询</strong

When you want to search in multiple fields then you could use QueryBuilders#multiMatchQuery() where you specify all the fields to match:

当你想搜索多个字段时,你可以使用QueryBuilders#multiMatchQuery(),你可以指定所有字段进行匹配。

NativeSearchQuery searchQuery = new NativeSearchQueryBuilder()
  .withQuery(multiMatchQuery("tutorial")
    .field("title")
    .field("tags")
    .type(MultiMatchQueryBuilder.Type.BEST_FIELDS))
  .build();

Here we search the title and tags fields for a match.

在这里,我们搜索titletags字段的匹配。

Notice that here we use the “best fields” scoring strategy. It will take the maximum score among the fields as a document score.

请注意,这里我们使用的是 “最佳字段 “的评分策略。它将取各字段中的最大分数作为文件分数。

7. Aggregations

7.聚集

In our Article class we have also defined a tags field, which is non-analyzed. We could easily create a tag cloud by using an aggregation.

在我们的文章类中,我们也定义了一个tags字段,这是非分析的。我们可以通过使用聚合来轻松创建一个标签云。

Remember that, because the field is non-analyzed, the tags will not be tokenized:

请记住,由于该字段是非分析性的,标签将不会被标记。

TermsAggregationBuilder aggregation = AggregationBuilders.terms("top_tags")
  .field("tags")
  .order(Terms.Order.count(false));
SearchSourceBuilder builder = new SearchSourceBuilder().aggregation(aggregation);

SearchRequest searchRequest = 
  new SearchRequest().indices("blog").types("article").source(builder);
SearchResponse response = client.search(searchRequest, RequestOptions.DEFAULT);

Map<String, Aggregation> results = response.getAggregations().asMap();
StringTerms topTags = (StringTerms) results.get("top_tags");

List<String> keys = topTags.getBuckets()
  .stream()
  .map(b -> b.getKeyAsString())
  .collect(toList());
assertEquals(asList("elasticsearch", "spring data", "search engines", "tutorial"), keys);

8. Summary

8.总结

In this article, we discussed the difference between analyzed and non-analyzed fields, and how this distinction affects search.

在这篇文章中,我们讨论了分析领域和非分析领域的区别,以及这种区别对搜索的影响。

We also learned about several types of queries provided by Elasticsearch, such as the match query, phrase match query, full-text search query, and boolean query.

我们还了解了Elasticsearch提供的几种类型的查询,如匹配查询、短语匹配查询、全文搜索查询和布尔查询。

Elasticsearch provides many other types of queries, such as geo queries, script queries and compound queries. You can read about them in the Elasticsearch documentation and explore the Spring Data Elasticsearch API in order to use these queries in your code.

Elasticsearch提供了许多其他类型的查询,例如地理查询、脚本查询和复合查询。您可以在Elasticsearch文档中阅读这些内容,并探索Spring Data Elasticsearch API,以便在您的代码中使用这些查询。

You can find a project containing the examples used in this article in the GitHub repository.

你可以在GitHub 仓库中找到包含本文中使用的示例的项目。