Guide to Lucene Analyzers – Lucene分析器指南

最后修改: 2018年 8月 27日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Lucene Analyzers are used to analyze text while indexing and searching documents.

Lucene分析器用于在索引和搜索文档时分析文本。

We mentioned analyzers briefly in our introductory tutorial.

我们在入门教程中简要地提到了分析器。

In this tutorial, we’ll discuss commonly used Analyzers, how to construct our custom analyzer and how to assign different analyzers for different document fields.

在本教程中,我们将讨论常用的分析器、如何构建我们的自定义分析器以及如何为不同的文档字段分配不同的分析器

2. Maven Dependencies

2.Maven的依赖性

First, we need to add these dependencies to our pom.xml:

首先,我们需要在我们的pom.xml中添加这些依赖项。

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.4.0</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>
<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-analyzers-common</artifactId>
    <version>7.4.0</version>
</dependency>

The latest Lucene version can be found here.

最新的Lucene版本可以在这里找到。

3. Lucene Analyzer

3.Lucene分析器

Lucene Analyzers split the text into tokens.

Lucene分析器将文本分割成标记。

Analyzers mainly consist of tokenizers and filters. Different analyzers consist of different combinations of tokenizers and filters.

分析器主要由标记器和过滤器组成。不同的分析器由标记器和过滤器的不同组合构成。

To demonstrate the difference between commonly used analyzers, we’ll use this following method:

为了证明常用的分析器之间的区别,我们将使用下面这个方法。

public List<String> analyze(String text, Analyzer analyzer) throws IOException{
    List<String> result = new ArrayList<String>();
    TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME, text);
    CharTermAttribute attr = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
       result.add(attr.toString());
    }       
    return result;
}

This method converts a given text into a list of tokens using the given analyzer.

该方法使用给定的分析器将给定的文本转换为一个标记列表。

4. Common Lucene Analyzers 

4.常见的Lucene分析器

Now, let’s have a look at some commonly used Lucene analyzers.

现在,让我们看一下一些常用的Lucene分析器。

4.1. StandardAnalyzer

4.1.StandardAnalyzer

We’ll start with the StandardAnalyzer which is the most commonly used analyzer:

我们将从StandardAnalyzer开始,它是最常用的分析器。

private static final String SAMPLE_TEXT
  = "This is baeldung.com Lucene Analyzers test";

@Test
public void whenUseStandardAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new StandardAnalyzer());

    assertThat(result, 
      contains("baeldung.com", "lucene", "analyzers","test"));
}

Note that the StandardAnalyzer can recognize URLs and emails.

请注意,StandardAnalyzer可以识别URL和电子邮件。

Also, it removes stop words and lowercases the generated tokens.

此外,它还删除了停顿词,并将生成的标记小写。

4.2. StopAnalyzer

4.2.StopAnalyzer

The StopAnalyzer consists of LetterTokenizer, LowerCaseFilter, and StopFilter:

StopAnalyzerLetterTokenizer、LowerCaseFilterStopFilter组成:

@Test
public void whenUseStopAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new StopAnalyzer());

    assertThat(result, 
      contains("baeldung", "com", "lucene", "analyzers", "test"));
}

In this example, the LetterTokenizer splits text by non-letter characters, while the StopFilter removes stop words from the token list.

在这个例子中,LetterTokenizer按非字母字符分割文本,而StopFilter从标记列表中删除停止词。

However, unlike the StandardAnalyzer, StopAnalyzer isn’t able to recognize URLs.

然而,与StandardAnalyzer不同,StopAnalyzer并不能识别URLs。

4.3. SimpleAnalyzer

4.3.SimpleAnalyzer

SimpleAnalyzer consists of LetterTokenizer and a LowerCaseFilter:

SimpleAnalyzerLetterTokenizerLowerCaseFilter组成。

@Test
public void whenUseSimpleAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new SimpleAnalyzer());

    assertThat(result, 
      contains("this", "is", "baeldung", "com", "lucene", "analyzers", "test"));
}

Here, the SimpleAnalyzer didn’t remove stop words. It also doesn’t recognize URLs.

在这里,SimpleAnalyzer没有删除停止词。它也没有识别URL。

4.4. WhitespaceAnalyzer

4.4.WhitespaceAnalyzer

The WhitespaceAnalyzer uses only a WhitespaceTokenizer which splits text by whitespace characters:

WhitespaceAnalyzer只使用一个WhitespaceTokenizer,它按空白字符分割文本。

@Test
public void whenUseWhiteSpaceAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new WhitespaceAnalyzer());

    assertThat(result, 
      contains("This", "is", "baeldung.com", "Lucene", "Analyzers", "test"));
}

4.5. KeywordAnalyzer

4.5.KeywordAnalyzer

The KeywordAnalyzer tokenizes input into a single token:

KeywordAnalyzer将输入内容标记为一个单一的标记。

@Test
public void whenUseKeywordAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new KeywordAnalyzer());

    assertThat(result, contains("This is baeldung.com Lucene Analyzers test"));
}

The KeywordAnalyzer is useful for fields like ids and zipcodes.

关键词分析器(KeywordAnalyzer)对ID和邮政编码等字段很有用。

4.6. Language Analyzers

4.6.语言分析器

There are also special analyzers for different languages like EnglishAnalyzerFrenchAnalyzer, and SpanishAnalyzer:

还有针对不同语言的特殊分析器,如EnglishAnalyzerFrenchAnalyzerSpanishAnalyzer

@Test
public void whenUseEnglishAnalyzer_thenAnalyzed() throws IOException {
    List<String> result = analyze(SAMPLE_TEXT, new EnglishAnalyzer());

    assertThat(result, contains("baeldung.com", "lucen", "analyz", "test"));
}

Here, we’re using the EnglishAnalyzer which consists of StandardTokenizer, StandardFilter, EnglishPossessiveFilter, LowerCaseFilter, StopFilter, and PorterStemFilter.

在这里,我们使用EnglishAnalyzer,它由StandardTokenizerStandardFilterEnglishPossessiveFilterLowerCaseFilterStopFilterPorterStemFilter组成。

5. Custom Analyzer 

5.自定义分析器

Next, let’s see how to build our custom analyzer. We’ll build the same custom analyzer in two different ways.

接下来,让我们看看如何建立我们的自定义分析器。我们将以两种不同的方式建立同一个自定义分析器。

In the first example, we’ll use the CustomAnalyzer builder to construct our analyzer from predefined tokenizers and filters:

在第一个例子中,我们将使用CustomAnalyzer构建器从预定义的标记器和过滤器构建我们的分析器

@Test
public void whenUseCustomAnalyzerBuilder_thenAnalyzed() throws IOException {
    Analyzer analyzer = CustomAnalyzer.builder()
      .withTokenizer("standard")
      .addTokenFilter("lowercase")
      .addTokenFilter("stop")
      .addTokenFilter("porterstem")
      .addTokenFilter("capitalization")
      .build();
    List<String> result = analyze(SAMPLE_TEXT, analyzer);

    assertThat(result, contains("Baeldung.com", "Lucen", "Analyz", "Test"));
}

Our analyzer is very similar to EnglishAnalyzer, but it capitalizes the tokens instead.

我们的分析器与EnglishAnalyzer非常相似,但它将代名词大写。

In the second example, we’ll build the same analyzer by extending the Analyzer abstract class and overriding the createComponents() method:

在第二个例子中,我们将通过扩展Analyzer抽象类并覆盖createComponents()方法来构建相同的分析器

public class MyCustomAnalyzer extends Analyzer {

    @Override
    protected TokenStreamComponents createComponents(String fieldName) {
        StandardTokenizer src = new StandardTokenizer();
        TokenStream result = new StandardFilter(src);
        result = new LowerCaseFilter(result);
        result = new StopFilter(result,  StandardAnalyzer.STOP_WORDS_SET);
        result = new PorterStemFilter(result);
        result = new CapitalizationFilter(result);
        return new TokenStreamComponents(src, result);
    }
}

We can also create our custom tokenizer or filter and add it to our custom analyzer if needed.

我们还可以创建我们的自定义标记器或过滤器,并在需要时将其添加到我们的自定义分析器。

Now, let’s see our custom analyzer in action – we’ll use InMemoryLuceneIndex in this example:

现在,让我们看看我们的自定义分析器的运行情况 – 我们将在这个例子中使用InMemoryLuceneIndex

@Test
public void givenTermQuery_whenUseCustomAnalyzer_thenCorrect() {
    InMemoryLuceneIndex luceneIndex = new InMemoryLuceneIndex(
      new RAMDirectory(), new MyCustomAnalyzer());
    luceneIndex.indexDocument("introduction", "introduction to lucene");
    luceneIndex.indexDocument("analyzers", "guide to lucene analyzers");
    Query query = new TermQuery(new Term("body", "Introduct"));

    List<Document> documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
}

6. PerFieldAnalyzerWrapper

6.PerFieldAnalyzerWrapper

Finally, we can assign different analyzers to different fields using PerFieldAnalyzerWrapper.

最后,我们可以使用PerFieldAnalyzerWrapper将不同的分析器分配给不同的字段。

First, we need to define our analyzerMap to map each analyzer to a specific field:

首先,我们需要定义我们的analyzerMap,将每个分析器映射到一个特定的字段。

Map<String,Analyzer> analyzerMap = new HashMap<>();
analyzerMap.put("title", new MyCustomAnalyzer());
analyzerMap.put("body", new EnglishAnalyzer());

We mapped the “title” to our custom analyzer and the “body” to the EnglishAnalyzer.

我们将 “标题 “映射到我们的自定义分析器,将 “正文 “映射到EnglishAnalyzer。

Next, let’s create our PerFieldAnalyzerWrapper by providing the analyzerMap and a default Analyzer:

接下来,让我们创建我们的PerFieldAnalyzerWrapper,提供analyzerMap和一个默认的Analyzer

PerFieldAnalyzerWrapper wrapper = new PerFieldAnalyzerWrapper(
  new StandardAnalyzer(), analyzerMap);

Now, let’s test it:

现在,让我们来测试一下。

@Test
public void givenTermQuery_whenUsePerFieldAnalyzerWrapper_thenCorrect() {
    InMemoryLuceneIndex luceneIndex = new InMemoryLuceneIndex(new RAMDirectory(), wrapper);
    luceneIndex.indexDocument("introduction", "introduction to lucene");
    luceneIndex.indexDocument("analyzers", "guide to lucene analyzers");
    
    Query query = new TermQuery(new Term("body", "introduct"));
    List<Document> documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
    
    query = new TermQuery(new Term("title", "Introduct"));
    documents = luceneIndex.searchIndex(query);
    assertEquals(1, documents.size());
}

7. Conclusion

7.结论

We discussed popular Lucene Analyzers, how to build a custom analyzer and how to use a different analyzer per field.

我们讨论了流行的Lucene分析器,如何建立一个自定义的分析器以及如何在每个字段使用不同的分析器。

The full source code can be found on GitHub.

完整的源代码可以在GitHub上找到