Introduction to Apache Lucene – Apache Lucene简介

最后修改: 2017年 12月 16日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Apache Lucene is a full-text search engine which can be used from various programming languages.

Apache Lucene是一个全文搜索引擎,可以从各种编程语言中使用。

In this article, we’ll try to understand the core concepts of the library and create a simple application.

在这篇文章中,我们将尝试理解该库的核心概念,并创建一个简单的应用程序。

2. Maven Setup

2.Maven的设置

To get started, let’s add necessary dependencies first:

为了开始工作,让我们先添加必要的依赖性。

<dependency>        
    <groupId>org.apache.lucene</groupId>          
    <artifactId>lucene-core</artifactId>
    <version>7.1.0</version>
</dependency>

The latest version can be found here.

最新版本可以在这里找到。

Also, for parsing our search queries, we’ll need:

此外,为了解析我们的搜索查询,我们将需要。

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.1.0</version>
</dependency>

Check for the latest version here.

检查最新版本这里

3. Core Concepts

3.核心概念

3.1. Indexing

3.1.编制索引

Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book.

简单地说,Lucene使用了一种数据的 “倒置索引”–不是将页面映射到关键词,而是将关键词映射到页面,就像任何书末的词汇表。

This allows for faster search responses, as it searches through an index, instead of searching through text directly.

这允许更快的搜索响应,因为它通过索引进行搜索,而不是直接通过文本搜索。

3.2. Documents

3.2.文件

Here, a document is a collection of fields, and each field has a value associated with it.

在这里,一个文档是一个字段的集合,每个字段都有一个与之相关的值。

Indices are typically made up of one or more documents, and search results are sets of best-matching documents.

指数通常由一个或多个文件组成,而搜索结果是最佳匹配文件的集合。

It isn’t always a plain text document, it could also be a database table or a collection.

它并不总是一个纯文本文件,它也可能是一个数据库表或一个集合。

3.3. Fields

3.3.领域

Documents can have field data, where a field is typically a key holding a data value:

文件可以有字段数据,其中字段通常是一个持有数据值的键。

title: Goodness of Tea
body: Discussing goodness of drinking herbal tea...

Notice that here title and body are fields and could be searched for together or individually.

注意这里titlebody是字段,可以一起或单独搜索。

3.4. Analysis

3.4.分析

An analysis is converting the given text into smaller and precise units for easy the sake of searching.

分析是将给定的文本转换为较小和精确的单位,以便于搜索。

The text goes through various operations of extracting keywords, removing common words and punctuations, changing words to lower case, etc.

文本要经过提取关键词、删除常用词和标点符号、将单词改为小写等各种操作。

For this purpose, there are multiple built-in analyzers:

为此,有多个内置的分析器。

  1. StandardAnalyzer – analyses based on basic grammar, removes stop words like “a”, “an” etc. Also converts in lowercase
  2. SimpleAnalyzer – breaks the text based on no-letter character and converts in lowercase
  3. WhiteSpaceAnalyzer – breaks the text based on white spaces

There’re more analyzers available for us to use and customize as well.

还有更多的分析器可供我们使用和定制。

3.5. Searching

3.5.搜索

Once an index is built, we can search that index using a Query and an IndexSearcher. The search result is typically a result set, containing the retrieved data.

一旦建立了索引,我们就可以使用 QueryIndexSearcher 来搜索该索引。 搜索结果通常是一个结果集,包含了检索的数据。

Note that an IndexWritter is responsible for creating the index and an IndexSearcher for searching the index.

请注意,IndexWritter 负责创建索引,IndexSearcher 负责搜索索引。

3.6. Query Syntax

3.6.查询语法

Lucene provides a very dynamic and easy to write query syntax.

Lucene提供了一个非常动态且易于编写的查询语法。

To search a free text, we’d just use a text String as the query.

要搜索一个自由文本,我们就用一个文本String作为查询。

To search a text in a particular field, we’d use:

要在一个特定的字段中搜索一个文本,我们会使用。

fieldName:text

eg: title:tea

Range searches:

范围搜索。

timestamp:[1509909322,1572981321]

We can also search using wildcards:

我们还可以使用通配符进行搜索。

dri?nk

would search for a single character in place of the wildcard “?”

将搜索一个单一的字符来代替通配符”?”。

d*k

searches for words starting with “d” and ending in “k”, with multiple characters in between.

搜索以 “d “开头,以 “k “结尾,中间有多个字符的词。

uni*

will find words starting with “uni”.

将找到以 “uni “开头的词语。

We may also combine these queries and create more complex queries. And include logical operator like AND, NOT, OR:

我们还可以将这些查询组合起来,创造出更复杂的查询。并包括逻辑运算符,如AND、NOT、OR。

title: "Tea in breakfast" AND "coffee"

More about query syntax here.

更多关于查询语法的信息这里

4. A Simple Application

4.一个简单的应用

Let’s create a simple application, and index some documents.

让我们创建一个简单的应用程序,并对一些文件进行索引。

First, we’ll create an in-memory index, and add some documents to it:

首先,我们将创建一个内存索引,并向其添加一些文件。

...
Directory memoryIndex = new RAMDirectory();
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writter = new IndexWriter(memoryIndex, indexWriterConfig);
Document document = new Document();

document.add(new TextField("title", title, Field.Store.YES));
document.add(new TextField("body", body, Field.Store.YES));

writter.addDocument(document);
writter.close();

Here, we create a document with TextField and add them to the index using the IndexWriter. The third argument in the TextField constructor indicates whether the value of the field is also to be stored or not.

在这里,我们用TextField 创建一个文档,并使用IndexWriter将它们添加到索引中。TextField 构造函数中的第三个参数表示是否也要存储该字段的值。

Analyzers are used to split the data or text into chunks, and then filter out the stop words from them. Stop words are words like ‘a’, ‘am’, ‘is’ etc. These completely depend on the given language.

分析器用于将数据或文本分割成若干块,然后过滤掉其中的停止词。停止词是指像 “a”、”am”、”is “等这样的词。这些完全取决于给定的语言。

Next, let’s create a search query and search the index for the added document:

接下来,让我们创建一个搜索查询,在索引中搜索添加的文件。

public List<Document> searchIndex(String inField, String queryString) {
    Query query = new QueryParser(inField, analyzer)
      .parse(queryString);

    IndexReader indexReader = DirectoryReader.open(memoryIndex);
    IndexSearcher searcher = new IndexSearcher(indexReader);
    TopDocs topDocs = searcher.search(query, 10);
    List<Document> documents = new ArrayList<>();
    for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
        documents.add(searcher.doc(scoreDoc.doc));
    }

    return documents;
}

In the search() method the second integer argument indicates how many top search results it should return.

search()方法中,第二个整数参数表示它应该返回多少个顶级搜索结果。

Now let’s test it:

现在我们来测试一下。

@Test
public void givenSearchQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Hello world", "Some hello world");
    
    List<Document> documents 
      = inMemoryLuceneIndex.searchIndex("body", "world");
    
    assertEquals(
      "Hello world", 
      documents.get(0).get("title"));
}

Here, we add a simple document to the index, with two fields ‘title’ and ‘body’, and then try to search the same using a search query.

在这里,我们将一个简单的文档添加到索引中,有两个字段 “标题 “和 “正文”,然后尝试使用搜索查询来搜索相同的内容。

6. Lucene Queries

6.Lucene查询

As we are now comfortable with the basics of indexing and searching, let us dig a little deeper.

由于我们现在对索引和搜索的基础知识已经很熟悉了,让我们再深入地挖掘一下。

In earlier sections, we’ve seen the basic query syntax, and how to convert that into a Query instance using the QueryParser.

在前面的章节中,我们已经看到了基本的查询语法,以及如何使用QueryParser将其转换为Query实例。

Lucene provides various concrete implementations as well:

Lucene也提供了各种具体的实现。

6.1. TermQuery

6.1. TermQuery

A Term is a basic unit for search, containing the field name together with the text to be searched for.

一个Term是搜索的基本单位,包含字段名和要搜索的文本。

TermQuery is the simplest of all queries consisting of a single term:

TermQuery是由单一术语组成的最简单的查询。

@Test
public void givenTermQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("activity", "running in track");
    inMemoryLuceneIndex.indexDocument("activity", "Cars are running on road");

    Term term = new Term("body", "running");
    Query query = new TermQuery(term);

    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(2, documents.size());
}

6.2. PrefixQuery

6.2.PrefixQuery

To search a document with a “starts with” word:

要搜索带有 “以 “字开头的文件。

@Test
public void givenPrefixQueryWhenFetchedDocumentThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("article", "Lucene introduction");
    inMemoryLuceneIndex.indexDocument("article", "Introduction to Lucene");

    Term term = new Term("body", "intro");
    Query query = new PrefixQuery(term);

    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(2, documents.size());
}

6.3. WildcardQuery

6.3.WildcardQuery

As the name suggests, we can use wildcards “*” or “?” for searching:

顾名思义,我们可以使用通配符 “*”或”?”进行搜索。

// ...
Term term = new Term("body", "intro*");
Query query = new WildcardQuery(term);
// ...

6.4. PhraseQuery

6.4.PhraseQuery

It’s used to search a sequence of texts in a document:

它用于搜索文件中的一连串文本。

// ...
inMemoryLuceneIndex.indexDocument(
  "quotes", 
  "A rose by any other name would smell as sweet.");

Query query = new PhraseQuery(
  1, "body", new BytesRef("smell"), new BytesRef("sweet"));

List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
// ...

Notice that the first argument of the PhraseQuery constructor is called slop, which is the distance in the number of words, between the terms to be matched.

请注意,PhraseQuery构造器的第一个参数被称为slop,,它是待匹配术语之间的字数距离。

6.5. FuzzyQuery

6.5. FuzzyQuery

We can use this when searching for something similar, but not necessarily identical:

我们可以在搜索类似但不一定相同的东西时使用这个方法。

// ...
inMemoryLuceneIndex.indexDocument("article", "Halloween Festival");
inMemoryLuceneIndex.indexDocument("decoration", "Decorations for Halloween");

Term term = new Term("body", "hallowen");
Query query = new FuzzyQuery(term);

List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
// ...

We tried searching for the text “Halloween”, but with miss-spelled “hallowen”.

我们试着搜索 “万圣节 “这个文本,但却错写成了 “hallowen”。

6.6. BooleanQuery

6.6.BooleanQuery

Sometimes we might need to execute complex searches, combining two or more different types of queries:

有时我们可能需要执行复杂的搜索,结合两个或更多不同类型的查询。

// ...
inMemoryLuceneIndex.indexDocument("Destination", "Las Vegas singapore car");
inMemoryLuceneIndex.indexDocument("Commutes in singapore", "Bus Car Bikes");

Term term1 = new Term("body", "singapore");
Term term2 = new Term("body", "car");

TermQuery query1 = new TermQuery(term1);
TermQuery query2 = new TermQuery(term2);

BooleanQuery booleanQuery 
  = new BooleanQuery.Builder()
    .add(query1, BooleanClause.Occur.MUST)
    .add(query2, BooleanClause.Occur.MUST)
    .build();
// ...

7. Sorting Search Results

7.对搜索结果进行排序

We may also sort the search results documents based on certain fields:

我们也可能根据某些领域对搜索结果文件进行排序。

@Test
public void givenSortFieldWhenSortedThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Ganges", "River in India");
    inMemoryLuceneIndex.indexDocument("Mekong", "This river flows in south Asia");
    inMemoryLuceneIndex.indexDocument("Amazon", "Rain forest river");
    inMemoryLuceneIndex.indexDocument("Rhine", "Belongs to Europe");
    inMemoryLuceneIndex.indexDocument("Nile", "Longest River");

    Term term = new Term("body", "river");
    Query query = new WildcardQuery(term);

    SortField sortField 
      = new SortField("title", SortField.Type.STRING_VAL, false);
    Sort sortByTitle = new Sort(sortField);

    List<Document> documents 
      = inMemoryLuceneIndex.searchIndex(query, sortByTitle);
    assertEquals(4, documents.size());
    assertEquals("Amazon", documents.get(0).getField("title").stringValue());
}

We tried to sort the fetched documents by title fields, which are the names of the rivers. The boolean argument to the SortField constructor is for reversing the sort order.

我们试图通过标题字段对获取的文件进行排序,标题字段是河流的名称。SortField构造器的布尔参数是用来反转排序顺序的。

8. Remove Documents from Index

8.从索引中删除文件

Let’s try to remove some documents from the index based on a given Term:

让我们试着根据给定的Term:从索引中删除一些文档。

// ...
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(memoryIndex, indexWriterConfig);
writer.deleteDocuments(term);
// ...

We’ll test this:

我们将测试这个。

@Test
public void whenDocumentDeletedThenCorrect() {
    InMemoryLuceneIndex inMemoryLuceneIndex 
      = new InMemoryLuceneIndex(new RAMDirectory(), new StandardAnalyzer());
    inMemoryLuceneIndex.indexDocument("Ganges", "River in India");
    inMemoryLuceneIndex.indexDocument("Mekong", "This river flows in south Asia");

    Term term = new Term("title", "ganges");
    inMemoryLuceneIndex.deleteDocument(term);

    Query query = new TermQuery(term);

    List<Document> documents = inMemoryLuceneIndex.searchIndex(query);
    assertEquals(0, documents.size());
}

9. Conclusion

9.结论

This article was a quick introduction to getting started with Apache Lucene. Also, we executed various queries and sorted the retrieved documents.

这篇文章是对Apache Lucene的快速入门介绍。此外,我们还执行了各种查询,并对检索到的文档进行了排序。

As always the code for the examples can be found over on Github.

像往常一样,这些例子的代码可以在Github上找到over