Removing Stopwords from a String in Java – 在Java中删除字符串中的止损词

最后修改: 2019年 5月 19日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll discuss different ways to remove stopwords from a String in Java. This is a useful operation in cases where we want to remove unwanted or disallowed words from a text, such as comments or reviews added by users of an online site.

在本教程中,我们将讨论在Java中从String中删除停止词的不同方法。在我们想从文本中删除不需要的或不允许的词的情况下,这是一个有用的操作,例如在线网站的用户添加的评论或评论。

We’ll use a simple loop, Collection.removeAll() and regular expressions.

我们将使用一个简单的循环,Collection.removeAll()和正则表达式。

Finally, we’ll compare their performance using the Java Microbenchmark Harness.

最后,我们将使用Java Microbenchmark Harness比较它们的性能。

2. Loading Stopwords

2.加载止损词

First, we’ll load our stopwords from a text file.

首先,我们将从一个文本文件中加载我们的停止词。

Here we have the file english_stopwords.txt which contain a list of words we consider stopwords, such as I, he, she, and the.

这里我们有一个文件english_stopwords.txt,其中包含一个我们认为是止损词的列表,如I, he, she, 和the

We’ll load the stopwords into a List of String using Files.readAllLines():

我们将使用 Files.readAllLines()将止语加载到ListString

@BeforeClass
public static void loadStopwords() throws IOException {
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
}

3. Removing Stopwords Manually

3.手动删除止损词

For our first solution, we’ll remove stopwords manually by iterating over each word and checking if it’s a stopword:

对于我们的第一个解决方案,我们将通过迭代每个词并检查它是否是一个止损词来手动删除止损词

@Test
public void whenRemoveStopwordsManually_thenSuccess() {
    String original = "The quick brown fox jumps over the lazy dog"; 
    String target = "quick brown fox jumps lazy dog";
    String[] allWords = original.toLowerCase().split(" ");

    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    
    String result = builder.toString().trim();
    assertEquals(result, target);
}

4. Using Collection.removeAll()

4.使用Collection.removeAll()

Next, instead of iterating over each word in our String, we can use Collection.removeAll() to remove all stopwords at once:

接下来,我们可以使用Collection.removeAll()来一次性删除所有止损词,而不是遍历我们的String中的每个词,

@Test
public void whenRemoveStopwordsUsingRemoveAll_thenSuccess() {
    ArrayList<String> allWords = 
      Stream.of(original.toLowerCase().split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);

    String result = allWords.stream().collect(Collectors.joining(" "));
    assertEquals(result, target);
}

In this example, after splitting our String into an array of words, we’ll transform it into an ArrayList to be able to apply the removeAll() method.

在这个例子中,在将我们的String分割成一个单词数组后,我们将把它转化为ArrayList,以便能够应用removeAll()方法。

5. Using Regular Expressions

5.使用正则表达式

Finally, we can create a regular expression from our stopwords list, then use it to replace stopwords in our String:

最后,我们可以从我们的stopwords列表中创建一个正则表达式,然后用它来替换我们的String中的停止词。

@Test
public void whenRemoveStopwordsUsingRegex_thenSuccess() {
    String stopwordsRegex = stopwords.stream()
      .collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));

    String result = original.toLowerCase().replaceAll(stopwordsRegex, "");
    assertEquals(result, target);
}

The resulting stopwordsRegex will have the format “\\b(he|she|the|…)\\b\\s?”. In this regex, “\b” refers to a word boundary, to avoid replacing “he” in “heat” for example, while “\s?” refers to zero or one space, to delete the extra space after replacing a stopword.

由此产生的stopwordsRegex将具有”\\b(he|she|the|…)\b\s? “的格式。在这个词组中,”\b “指的是一个词的边界,以避免在 “he “中替换 “he”,而”\s? “指的是零或一个空格,以删除替换停止词后的额外空格。

6. Performance Comparison

6.性能比较

Now, let’s see which method has the best performance.

现在,让我们看看哪种方法的性能最好。

First, let’s set up our benchmark. We’ll use a rather big text file as the source of our String called shakespeare-hamlet.txt:

首先,让我们设置我们的基准。我们将使用一个相当大的文本文件作为我们的字符串的来源,称为shakespeare-hamlet.txt

@Setup
public void setup() throws IOException {
    data = new String(Files.readAllBytes(Paths.get("shakespeare-hamlet.txt")));
    data = data.toLowerCase();
    stopwords = Files.readAllLines(Paths.get("english_stopwords.txt"));
    stopwordsRegex = stopwords.stream().collect(Collectors.joining("|", "\\b(", ")\\b\\s?"));
}

Then we’ll have our benchmark methods, starting with removeManually():

然后我们将有我们的基准方法,从removeManually()开始。

@Benchmark
public String removeManually() {
    String[] allWords = data.split(" ");
    StringBuilder builder = new StringBuilder();
    for(String word : allWords) {
        if(!stopwords.contains(word)) {
            builder.append(word);
            builder.append(' ');
        }
    }
    return builder.toString().trim();
}

Next, we have the removeAll() benchmark:

接下来,我们有removeAll()基准。

@Benchmark
public String removeAll() {
    ArrayList<String> allWords = 
      Stream.of(data.split(" "))
            .collect(Collectors.toCollection(ArrayList<String>::new));
    allWords.removeAll(stopwords);
    return allWords.stream().collect(Collectors.joining(" "));
}

Finally, we’ll add the benchmark for replaceRegex():

最后,我们将添加replaceRegex()的基准。

@Benchmark
public String replaceRegex() {
    return data.replaceAll(stopwordsRegex, "");
}

And here’s the result of our benchmark:

下面是我们的基准测试结果。

Benchmark                           Mode  Cnt   Score    Error  Units
removeAll                           avgt   60   7.782 ±  0.076  ms/op
removeManually                      avgt   60   8.186 ±  0.348  ms/op
replaceRegex                        avgt   60  42.035 ±  1.098  ms/op

It seems like using Collection.removeAll() has the fastest execution time while using regular expressions is the slowest.

似乎使用Collection.removeAll()执行时间最快,而使用正则表达式最慢

7. Conclusion

7.结论

In this quick article, we learned different methods to remove stopwords from a String in Java. We also benchmarked them to see which method has the best performance.

在这篇快速文章中,我们学习了在Java中从String中删除停止词的不同方法。我们还对它们进行了基准测试,看看哪种方法的性能最好。

The full source code for the examples is available over on GitHub.

例子的完整源代码可在GitHub上获得over