Intro to Apache OpenNLP – 阿帕奇OpenNLP介绍

最后修改: 2018年 4月 25日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Apache OpenNLP is an open source Natural Language Processing Java library.

Apache OpenNLP是一个开源的自然语言处理Java库。

It features an API for use cases like Named Entity Recognition, Sentence Detection, POS tagging and Tokenization.

它有一个用于命名实体识别、句子检测、POS标记和标记化等使用案例的API。

In this tutorial, we’ll have a look at how to use this API for different use cases.

在本教程中,我们将看看如何在不同的用例中使用这个API。

2. Maven Setup

2.Maven的设置

First, we need to add the main dependency to our pom.xml:

首先,我们需要在pom.xml中添加主要的依赖性。

<dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>1.8.4</version>
</dependency>

The latest stable version can be found over on Maven Central.

最新的稳定版本可以在Maven Central上找到。

Some use cases need trained models. You can download pre-defined models here and detailed information about these models here.

有些用例需要经过训练的模型。你可以下载预定义的模型这里和有关这些模型的详细信息这里

3. Sentence Detection

3.句子检测

Let’s start with understanding what a sentence is.

让我们从理解什么是句子开始。

Sentence detection is about identifying the start and the end of a sentence, which usually depends on the language at hand. This is also called “Sentence Boundary Disambiguation” (SBD). 

句子检测是指识别一个句子的开始和结束,这通常取决于手头的语言。这也被称为 “句子边界歧义”(SBD)。

In some cases, sentence detection is quite challenging because of the ambiguous nature of the period character. A period usually denotes the end of a sentence but can also appear in an email address, an abbreviation, a decimal, and a lot of other places.

在某些情况下,句子检测是相当具有挑战性的,因为句号字符具有模糊性。句号通常表示一个句子的结尾,但也可能出现在电子邮件地址、缩写、小数和其他很多地方。

As for most NLP tasks, for sentence detection, we need a trained model as input, which we expect to reside in the /resources folder.

与大多数NLP任务一样,对于句子检测,我们需要一个训练有素的模型作为输入,我们希望该模型位于/resources文件夹中。

To implement sentence detection, we load the model and pass it into an instance of SentenceDetectorME. Then, we simply pass a text into the sentDetect() method to split it at the sentence boundaries:

为了实现句子检测,我们加载模型并将其传入SentenceDetectorME的一个实例。然后,我们简单地将一个文本传入sentDetect()方法,在句子边界处进行分割:

@Test
public void givenEnglishModel_whenDetect_thenSentencesAreDetected() 
  throws Exception {

    String paragraph = "This is a statement. This is another statement." 
      + "Now is an abstract word for time, "
      + "that is always flying. And my email address is google@gmail.com.";

    InputStream is = getClass().getResourceAsStream("/models/en-sent.bin");
    SentenceModel model = new SentenceModel(is);

    SentenceDetectorME sdetector = new SentenceDetectorME(model);

    String sentences[] = sdetector.sentDetect(paragraph);
    assertThat(sentences).contains(
      "This is a statement.",
      "This is another statement.",
      "Now is an abstract word for time, that is always flying.",
      "And my email address is google@gmail.com.");
}

Note: the suffix “ME” is used in many class names in Apache OpenNLP and represents an algorithm that is based on “Maximum Entropy”.

注意:后缀 “ME “在Apache OpenNLP的许多类名称中使用,代表一种基于 “最大熵 “的算法。

4. Tokenizing

4.符号化

Now that we can divide a corpus of text into sentences, we can start analyzing a sentence in more detail.

现在我们可以把一个文本语料库划分为句子,我们可以开始更详细地分析一个句子。

The goal of tokenization is to divide a sentence into smaller parts called tokens. Usually, these tokens are words, numbers or punctuation marks.

标记化的目标是将一个句子分成更小的部分,称为标记。通常情况下,这些标记是单词、数字或标点符号。

There’re three types of tokenizers available in OpenNLP.

在OpenNLP中,有三种类型的标记器可用。

4.1. Using TokenizerME

4.1.使用TokenizerME

In this case, we first need to load the model. We can download the model file from here, put it in the /resources folder and load it from there.

在这种情况下,我们首先需要加载模型。我们可以从这里下载模型文件,把它放在/resources文件夹中,然后从那里加载它。

Next, we’ll create an instance of TokenizerME using the loaded model, and use the tokenize() method to perform tokenization on any String:

接下来,我们将使用加载的模型创建一个TokenizerME的实例,并使用tokenize()方法对任何String:进行标记化。

@Test
public void givenEnglishModel_whenTokenize_thenTokensAreDetected() 
  throws Exception {
 
    InputStream inputStream = getClass()
      .getResourceAsStream("/models/en-token.bin");
    TokenizerModel model = new TokenizerModel(inputStream);
    TokenizerME tokenizer = new TokenizerME(model);
    String[] tokens = tokenizer.tokenize("Baeldung is a Spring Resource.");
 
    assertThat(tokens).contains(
      "Baeldung", "is", "a", "Spring", "Resource", ".");
}

As we can see, the tokenizer has identified all words and the period character as separate tokens. This tokenizer can be used with a custom trained model as well.

正如我们所看到的,标记器已经将所有的单词和句号字符识别为独立的标记。这个标记器也可用于自定义训练的模型。

4.2. WhitespaceTokenizer

4.2.WhitespaceTokenizer

As the name suggests, this tokenizer simply splits the sentence into tokens using whitespace characters as delimiters:

顾名思义,这个标记器只是用空白字符作为分隔符将句子分割成标记。

@Test
public void givenWhitespaceTokenizer_whenTokenize_thenTokensAreDetected() 
  throws Exception {
 
    WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
    String[] tokens = tokenizer.tokenize("Baeldung is a Spring Resource.");
 
    assertThat(tokens)
      .contains("Baeldung", "is", "a", "Spring", "Resource.");
  }

We can see that the sentence has been split by white spaces and hence we get “Resource.” (with the period character at the end) as a single token instead of two different tokens for the word “Resource” and the period character.

我们可以看到,该句子已被空白处分割,因此我们得到了 “资源”。(末尾有句号)作为一个标记,而不是 “资源 “一词和句号的两个不同标记。

4.3. SimpleTokenizer

4.3.SimpleTokenizer

This tokenizer is a little more sophisticated than WhitespaceTokenizer and splits the sentence into words, numbers, and punctuation marks. It’s the default behavior and doesn’t require any model:

这个标记器比WhitespaceTokenizer更复杂一些,它将句子分成单词、数字和标点符号。它是默认行为,不需要任何模型。

@Test
public void givenSimpleTokenizer_whenTokenize_thenTokensAreDetected() 
  throws Exception {
 
    SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer
      .tokenize("Baeldung is a Spring Resource.");
 
    assertThat(tokens)
      .contains("Baeldung", "is", "a", "Spring", "Resource", ".");
  }

5. Named Entity Recognition

5.命名实体识别

Now that we have understood tokenization, let’s take a look at a first use case that is based on successful tokenization: named entity recognition (NER).

现在我们已经了解了标记化,让我们看看基于成功标记化的第一个用例:命名实体识别(NER)。

The goal of NER is to find named entities like people, locations, organizations and other named things in a given text.

NER的目标是在给定的文本中找到诸如人物、地点、组织和其他命名的事物等命名实体。

OpenNLP uses pre-defined models for person names, date and time, locations, and organizations. We need to load the model using TokenNameFinderModel and pass it into an instance of NameFinderME. Then we can use the find() method to find named entities in a given text:

OpenNLP为人名、日期和时间、地点和组织使用了预定义的模型。我们需要使用TokenNameFinderModel加载模型,并将其传入NameFinderME.的实例中,然后我们可以使用find()方法来查找给定文本中的命名实体。

@Test
public void 
  givenEnglishPersonModel_whenNER_thenPersonsAreDetected() 
  throws Exception {

    SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer
      .tokenize("John is 26 years old. His best friend's "  
        + "name is Leonard. He has a sister named Penny.");

    InputStream inputStreamNameFinder = getClass()
      .getResourceAsStream("/models/en-ner-person.bin");
    TokenNameFinderModel model = new TokenNameFinderModel(
      inputStreamNameFinder);
    NameFinderME nameFinderME = new NameFinderME(model);
    List<Span> spans = Arrays.asList(nameFinderME.find(tokens));

    assertThat(spans.toString())
      .isEqualTo("[[0..1) person, [13..14) person, [20..21) person]");
}

As we can see in the assertion, the result is a list of Span objects containing the start and end indices of the tokens which compose named entities in the text.

正如我们在断言中所看到的,结果是一个Span对象的列表,其中包含构成文本中命名实体的标记的开始和结束索引。

6. Part-of-Speech Tagging

6.语篇标签化

Another use case that needs a list of tokens as input is part-of-speech tagging.

另一个需要标记列表作为输入的用例是部分语音标记。

A part-of-speech (POS) identifies the type of a word. OpenNLP uses the following tags for the different parts-of-speech:

言语部分(POS)标识了一个词的类型。OpenNLP对不同的言语部分使用以下标签:

  • NN – noun, singular or mass
  • DT – determiner
  • VB – verb, base form
  • VBD – verb, past tense
  • VBZ – verb, third person singular present
  • IN – preposition or subordinating conjunction
  • NNP – proper noun, singular
  • TO – the word “to”
  • JJ – adjective

These are same tags as defined in the Penn Tree Bank. For a complete list please refer to this list.

这些是与宾夕法尼亚州银行中定义的相同的标签。完整的列表请参考这个列表

Similar to the NER example, we load the appropriate model and then use POSTaggerME and its method tag() on a set of tokens to tag the sentence:

与NER的例子类似,我们加载适当的模型,然后使用POSTaggerME及其方法tag()对一组标记的句子进行标记。

@Test
public void givenPOSModel_whenPOSTagging_thenPOSAreDetected() 
  throws Exception {
 
    SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer.tokenize("John has a sister named Penny.");

    InputStream inputStreamPOSTagger = getClass()
      .getResourceAsStream("/models/en-pos-maxent.bin");
    POSModel posModel = new POSModel(inputStreamPOSTagger);
    POSTaggerME posTagger = new POSTaggerME(posModel);
    String tags[] = posTagger.tag(tokens);
 
    assertThat(tags).contains("NNP", "VBZ", "DT", "NN", "VBN", "NNP", ".");
}

The tag() method maps the tokens into a list of POS tags. The result in the example is:

tag()方法将标记映射成POS标记的列表。本例中的结果是

  1. “John” – NNP (proper noun)
  2. “has” – VBZ (verb)
  3. “a” – DT (determiner)
  4. “sister” – NN (noun)
  5. “named” – VBZ (verb)
  6. “Penny” – NNP (proper noun)
  7. “.” – period

7. Lemmatization

7.推理化

Now that we have the part-of-speech information of the tokens in a sentence, we can analyze the text even further.

现在我们有了一个句子中的标记的语音部分信息,我们可以进一步分析文本。

Lemmatization is the process of mapping a word form that can have a tense, gender, mood or other information to the base form of the word – also called its “lemma”.

语义化是将一个可以有时态、性别、语气或其他信息的词形映射到该词的基本形式–也称为其 “词形”的过程。

A lemmatizer takes a token and its part-of-speech tag as input and returns the word’s lemma. Hence, before Lemmatization, the sentence should be passed through a tokenizer and POS tagger.

词法分析器将一个标记和它的语料部分标签作为输入,并返回该词的词法。因此,在Lemmatization之前,该句子应通过标记器和POS标签器。

Apache OpenNLP provides two types of lemmatization:

Apache OpenNLP提供两种类型的词法:

  • Statistical – needs a lemmatizer model built using training data for finding the lemma of a given word
  • Dictionary-based – requires a dictionary which contains all valid combinations of a word, POS tags, and the corresponding lemma

For statistical lemmatization, we need to train a model, whereas for the dictionary lemmatization we just need a dictionary file like this one.

对于统计词法,我们需要训练一个模型,而对于字典词法,我们只需要一个字典文件,如这个

Let’s look at a code example using a dictionary file:

让我们看看一个使用字典文件的代码例子。

@Test
public void givenEnglishDictionary_whenLemmatize_thenLemmasAreDetected() 
  throws Exception {

    SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer.tokenize("John has a sister named Penny.");

    InputStream inputStreamPOSTagger = getClass()
      .getResourceAsStream("/models/en-pos-maxent.bin");
    POSModel posModel = new POSModel(inputStreamPOSTagger);
    POSTaggerME posTagger = new POSTaggerME(posModel);
    String tags[] = posTagger.tag(tokens);
    InputStream dictLemmatizer = getClass()
      .getResourceAsStream("/models/en-lemmatizer.dict");
    DictionaryLemmatizer lemmatizer = new DictionaryLemmatizer(
      dictLemmatizer);
    String[] lemmas = lemmatizer.lemmatize(tokens, tags);

    assertThat(lemmas)
      .contains("O", "have", "a", "sister", "name", "O", "O");
}

As we can see, we get the lemma for every token. “O” indicates that the lemma could not be determined as the word is a proper noun. So, we don’t have a lemma for “John” and “Penny”.

正如我们所看到的,我们得到了每个标记的词条。”O “表示无法确定词表,因为该词是一个专有名词。因此,我们没有 “John “和 “Penny “的词表。

But we have identified the lemmas for the other words of the sentence:

但我们已经确定了句子中其他词语的词法。

  • has – have
  • a – a
  • sister – sister
  • named – name

8. Chunking

8.分组

Part-of-speech information is also essential in chunking – dividing sentences into grammatically meaningful word groups like noun groups or verb groups.

语篇信息在分块中也是必不可少的–将句子分成有语法意义的词组,如名词组或动词组。

Similar to before, we tokenize a sentence and use part-of-speech tagging on the tokens before the calling the chunk() method:

与之前类似,我们对一个句子进行标记,并在调用chunk()方法之前对标记使用语义部分标记:

@Test
public void 
  givenChunkerModel_whenChunk_thenChunksAreDetected() 
  throws Exception {

    SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer.tokenize("He reckons the current account 
      deficit will narrow to only 8 billion.");

    InputStream inputStreamPOSTagger = getClass()
      .getResourceAsStream("/models/en-pos-maxent.bin");
    POSModel posModel = new POSModel(inputStreamPOSTagger);
    POSTaggerME posTagger = new POSTaggerME(posModel);
    String tags[] = posTagger.tag(tokens);

    InputStream inputStreamChunker = getClass()
      .getResourceAsStream("/models/en-chunker.bin");
    ChunkerModel chunkerModel
     = new ChunkerModel(inputStreamChunker);
    ChunkerME chunker = new ChunkerME(chunkerModel);
    String[] chunks = chunker.chunk(tokens, tags);
    assertThat(chunks).contains(
      "B-NP", "B-VP", "B-NP", "I-NP", 
      "I-NP", "I-NP", "B-VP", "I-VP", 
      "B-PP", "B-NP", "I-NP", "I-NP", "O");
}

As we can see, we get an output for each token from the chunker. “B” represents the start of a chunk, “I” represents the continuation of the chunk and “O” represents no chunk.

正如我们所看到的,我们从chunker那里得到了每个token的输出。”B “代表一个分块的开始,”I “代表分块的继续,”O “代表没有分块。

Parsing the output from our example, we get 6 chunks:

从我们的例子中解析输出,我们得到6个块。

  1. “He” – noun phrase
  2. “reckons” – verb phrase
  3. “the current account deficit” – noun phrase
  4. “will narrow” – verb phrase
  5. “to” – preposition phrase
  6. “only 8 billion” – noun phrase

9. Language Detection

9.语言检测

Additionally to the use cases already discussed, OpenNLP also provides a language detection API that allows to identify the language of a certain text. 

除了已经讨论过的用例之外,OpenNLP还提供了一个语言检测API,可以识别某个文本的语言。

For language detection, we need a training data file. Such a file contains lines with sentences in a certain language. Each line is tagged with the correct language to provide input to the machine learning algorithms.

对于语言检测,我们需要一个训练数据文件。这样的文件包含有某种语言的句子行。每一行都被标记为正确的语言,以提供给机器学习算法的输入。

A sample training data file for language detection can be downloaded here.

可以下载这里用于语言检测的训练数据样本文件。

We can load the training data file into a LanguageDetectorSampleStream, define some training data parameters, create a model and then use the model to detect the language of a text:

我们可以将训练数据文件加载到LanguageDetectorSampleStream中,定义一些训练数据参数,创建一个模型,然后使用该模型来检测一个文本的语言:

@Test
public void 
  givenLanguageDictionary_whenLanguageDetect_thenLanguageIsDetected() 
  throws FileNotFoundException, IOException {
 
    InputStreamFactory dataIn
     = new MarkableFileInputStreamFactory(
       new File("src/main/resources/models/DoccatSample.txt"));
    ObjectStream lineStream = new PlainTextByLineStream(dataIn, "UTF-8");
    LanguageDetectorSampleStream sampleStream
     = new LanguageDetectorSampleStream(lineStream);
    TrainingParameters params = new TrainingParameters();
    params.put(TrainingParameters.ITERATIONS_PARAM, 100);
    params.put(TrainingParameters.CUTOFF_PARAM, 5);
    params.put("DataIndexer", "TwoPass");
    params.put(TrainingParameters.ALGORITHM_PARAM, "NAIVEBAYES");

    LanguageDetectorModel model = LanguageDetectorME
      .train(sampleStream, params, new LanguageDetectorFactory());

    LanguageDetector ld = new LanguageDetectorME(model);
    Language[] languages = ld
      .predictLanguages("estava em uma marcenaria na Rua Bruno");
    assertThat(Arrays.asList(languages))
      .extracting("lang", "confidence")
      .contains(
        tuple("pob", 0.9999999950605625),
        tuple("ita", 4.939427661577956E-9), 
        tuple("spa", 9.665954064665144E-15),
        tuple("fra", 8.250349924885834E-25)));
}

The result is a list of the most probable languages along with a confidence score.

结果是一个最可能的语言列表,以及一个信心分数。

And, with rich models, we can achieve a very higher accuracy with this type of detection.

而且,通过丰富的模型,我们可以通过这种类型的检测达到非常高的精确度。

5. Conclusion

5.总结

We explored a lot here, from the interesting capabilities of OpenNLP. We focused on some interesting features to perform NLP tasks like lemmatization, POS tagging, Tokenization, Sentence Detection, Language Detection and more.

我们在这里探索了很多,从OpenNLP的有趣功能。我们专注于一些有趣的功能,以执行NLP任务,如词法化、POS标记、标记化、句子检测、语言检测等。

As always, the complete implementation of all above can be found over on GitHub.

一如既往,上述所有内容的完整实现可以在GitHub上找到over