Quick Intro to Full-Text Search with ElasticSearch – 使用ElasticSearch进行全文搜索的快速介绍

最后修改: 2017年 2月 10日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Full-text search queries and performs linguistic searches against documents. It includes single or multiple words or phrases and returns documents that match search condition.

全文搜索查询和执行针对文件的语言学搜索。它包括单个或多个单词或短语,并返回符合搜索条件的文档。

ElasticSearch is a search engine based on Apache Lucene, a free and open-source information retrieval software library. It provides a distributed, full-text search engine with an HTTP web interface and schema-free JSON documents.

ElasticSearch是一个基于Apache Lucene的搜索引擎,它是一个免费和开源的信息检索软件库。它提供了一个分布式的全文搜索引擎,具有HTTP网络接口和无模式的JSON文档。

This article examines ElasticSearch REST API and demonstrates basic operations using HTTP requests only.

本文研究了ElasticSearch REST API,并演示了仅使用HTTP请求的基本操作。

2. Setup

2.设置

In order to install ElasticSearch on your machine, please refer to the official setup guide.

为了在你的机器上安装ElasticSearch,请参考官方的安装指南

RESTfull API runs on port 9200. Let us test if it is running properly using the following curl command:

RESTfull API在9200端口运行。让我们用下面的curl命令来测试它是否正常运行。

curl -XGET 'http://localhost:9200/'

If you observe the following response, the instance is properly running:

如果你观察到以下响应,说明实例正在正常运行。

{
  "name": "NaIlQWU",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "enkBkWqqQrS0vp_NXmjQMQ",
  "version": {
    "number": "5.1.2",
    "build_hash": "c8c4c16",
    "build_date": "2017-01-11T20:18:39.146Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

3. Indexing Documents

3.为文件编制索引

ElasticSearch is document oriented. It stores and indexes documents. Indexing creates or updates documents. After indexing, you can search, sort, and filter complete documents—not rows of columnar data. This is a fundamentally different way of thinking about data and is one of the reasons ElasticSearch can perform a complex full-text search.

ElasticSearch是面向文档的。它存储和索引文档。索引创建或更新文档。在索引之后,你可以搜索、排序和过滤完整的文档–而不是列数据的行。这是一种根本不同的数据思维方式,也是ElasticSearch可以进行复杂的全文搜索的原因之一。

Documents are represented as JSON objects. JSON serialization is supported by most programming languages and has become the standard format used by the NoSQL movement. It is simple, concise, and easy to read.

文件被表示为JSON对象。JSON序列化被大多数编程语言所支持,并已成为NoSQL运动所使用的标准格式。它简单、简明,易于阅读。

We are going to use the following random entries to perform our full-text search:

我们将使用以下随机条目来执行我们的全文搜索。

{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}

{
  "title": "He oppose",
  "random_text": 
    "He oppose at thrown desire of no. \
      Announcing impression unaffected day his are unreserved indulgence."
}

{
  "title": "Repulsive questions",
  "random_text": "Repulsive questions contented him few extensive supported."
}

{
  "title": "Old education",
  "random_text": "Old education him departure any arranging one prevailed."
}

Before we can index a document, we need to decide where to store it. It’s possible to have multiple indexes, which in turn contain multiple types. These types hold multiple documents, and each document has multiple fields.

在我们为一个文件建立索引之前,我们需要决定将其存储在哪里。有可能有多个索引,而这些索引又包含多个类型。这些类型容纳多个文档,每个文档有多个字段。

We are going to store our documents using the following scheme:

我们将使用以下方案来存储我们的文件。

text: The index name.
article: The type name.
id: The ID of this particular example text-entry.

text: 索引名称。
article: 类型名称。
id: 这个特定例子的文本条目的ID。

To add a document we are going to run the following command:

为了添加一个文件,我们将运行以下命令。

curl -XPUT 'localhost:9200/text/article/1?pretty'
  -H 'Content-Type: application/json' -d '
{
  "title": "He went",
  "random_text": 
    "He went such dare good fact. The small own seven saved man age."
}'

Here we are using id=1, we can add other entries using the same command and incremented id.

这里我们使用的是id=1,我们可以使用相同的命令和递增的id添加其他条目。

4. Retrieving Documents

4.检索文件

After we add all our documents we can check how many documents, using the following command, we have in the cluster :

在我们添加了所有的文件之后,我们可以使用下面的命令检查我们在集群中拥有多少个文件。

curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
  "query": {
    "match_all": {}
  }
}'

Also, we can get a document using its id with the following command:

此外,我们还可以通过以下命令用它的id来获得一个文档。

curl -XGET 'localhost:9200/text/article/1?pretty'

And we should get the following answer from elastic search:

而我们应该从弹性搜索中得到以下答案。

{
  "_index": "text",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "title": "He went",
    "random_text": 
      "He went such dare good fact. The small own seven saved man age."
  }
}

As we can see this answer correspond with the entry added using the id 1.

我们可以看到,这个答案与使用ID 1添加的条目相对应。

5. Querying Documents

5.查询文件

OK let’s perform a full-text search with the following command:

好吧,让我们用以下命令进行全文搜索。

curl -XGET 'localhost:9200/text/article/_search?pretty' 
  -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "random_text": "him departure"
    }
  }
}'

And we get the following result:

而我们得到的结果如下。

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.4513469,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.28582606,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      }
    ]
  }
}

As we can see we are looking for “him departure” and we get two results with different scores. The first result is obvious because the text have the performed search inside of it and as we can see we have the score of 1.4513469.

我们可以看到,我们正在寻找“他离开”,我们得到两个不同分数的结果。第一个结果是显而易见的,因为文本里面有执行的搜索,我们可以看到我们的分数是1.4513469

The second result is retrieved because the target document contains the word “him”.

第二个结果被检索出来,因为目标文件包含 “他 “这个词。

By default, ElasticSearch sorts matching results by their relevance score, that is, by how well each document matches the query. Note, that the score of the second result is small relative to the first hit, indicating lower relevance.

默认情况下,ElasticSearch 按照相关性分数对匹配结果进行排序,也就是说,按照每个文档与查询的匹配程度进行排序。请注意,第二个结果的分数相对于第一个结果来说是很小的,这表明其相关性较低。

6. Fuzzy Search

6.模糊搜索

Fuzzy matching treats two words that are “fuzzily” similar as if they were the same word. First, we need to define what we mean by fuzziness.

模糊匹配将两个 “模糊 “相似的词当作是同一个词。首先,我们需要定义我们的模糊性是什么意思。

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

Elasticsearch支持最大编辑距离,用模糊度参数指定,为2。模糊度参数可以设置为AUTO,这导致以下最大编辑距离。

  • 0 for strings of one or two characters
  • 1 for strings of three, four, or five characters
  • 2 for strings of more than five characters

you may find that an edit distance of 2 returns results that don’t appear to be related.

你可能会发现,编辑距离为2时,返回的结果似乎并不相关。

You may get better results, and better performance, with a maximum fuzziness of 1. Distance refers to the Levenshtein distance that is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits.

你可能会得到更好的结果,以及更好的性能,最大模糊度为1。 距离指的是列文斯坦距离,它是衡量两个序列之间差异的字符串指标。非正式地讲,两个词之间的列文斯坦距离是单字符编辑的最小数量。

OK let’s perform our search with fuzziness:

好吧,让我们用模糊性来执行我们的搜索。

curl -XGET 'localhost:9200/text/article/_search?pretty' -H 'Content-Type: application/json' -d' 
{ 
  "query": 
  { 
    "match": 
    { 
      "random_text": 
      {
        "query": "him departure",
        "fuzziness": "2"
      }
    } 
  } 
}'

And here’s the result:

结果是这样的。

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5834423,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "2",
        "_score": 0.41093433,
        "_source": {
          "title": "He oppose",
          "random_text":
            "He oppose at thrown desire of no. 
              \ Announcing impression unaffected day his are unreserved indulgence."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "1",
        "_score": 0.0,
        "_source": {
          "title": "He went",
          "random_text": "He went such dare good fact. The small own seven saved man age."
        }
      }
    ]
  }
}'

As we can see the fuzziness give us more results.

我们可以看到,模糊性给了我们更多的结果。

We need to use fuzziness carefully because it tends to retrieve results that look unrelated.

我们需要谨慎使用模糊性,因为它往往会检索到看起来不相关的结果。

7. Conclusion

7.结论

In this quick tutorial we focused on indexing documents and querying Elasticsearch for full-text search, directly via it’s REST API.

在这个快速教程中,我们重点介绍了直接通过Elasticsearch的REST API索引文档和查询全文搜索

We, of course, have APIs available for multiple programming languages when we need to – but the API is still quite convenient and language agnostic.

当然,当我们需要时,我们有适用于多种编程语言的API–但API仍然是相当方便的,与语言无关。