1. Introduction
1.导言
In this quick tutorial, we’ll find out how to detect multiple words inside of a string.
在这个快速教程中,我们将了解如何检测一个字符串内的多个单词。
2. Our Example
2.我们的例子
Let’s suppose we have the string:
让我们假设我们有了这个字符串。
String inputString = "hello there, Baeldung";
Our task is to find whether the inputString contains the “hello” and “Baeldung” words.
我们的任务是找到inputString是否包含“hello”和“Baeldung”单词。
So, let’s put our keywords into an array:
因此,让我们把我们的关键词放到一个数组中。
String[] words = {"hello", "Baeldung"};
Moreover, the order of the words isn’t important, and the matches should be case-sensitive.
此外,单词的顺序并不重要,而且匹配应该是区分大小写的。
3. Using String.contains()
3.使用String.contains()
As a start, we’ll show how to use the String.contains() method to achieve our goal.
作为开始,我们将展示如何使用String.contains()方法来实现我们的目标。
Let’s loop over the keywords array and check the occurrence of each item inside of the inputString:
让我们在关键字数组上循环,检查每个项目在inputString中的出现情况:。
public static boolean containsWords(String inputString, String[] items) {
boolean found = true;
for (String item : items) {
if (!inputString.contains(item)) {
found = false;
break;
}
}
return found;
}
The contains() method will return true if the inputString contains the given item. When we don’t have any of the keywords inside our string, we can stop moving forward and return an immediate false.
如果inputString包含给定的item,contains()方法将返回true。当我们的字符串内没有任何关键字时,我们可以停止前进,并立即返回一个false。
Despite the fact that we need to write more code, this solution is fast for simple use cases.
尽管我们需要编写更多的代码,但这个解决方案对于简单的用例来说是快速的。
4. Using String.indexOf()
4.使用String.indexOf()
Similar to the solution that uses the String.contains() method, we can check the indices of the keywords by using the String.indexOf() method. For that, we need a method accepting the inputString and the list of the keywords:
与使用String.contains()方法的解决方案类似,我们可以通过使用String.indexOf()方法检查关键词的索引。为此,我们需要一个接受inputString和关键词列表的方法。
public static boolean containsWordsIndexOf(String inputString, String[] words) {
boolean found = true;
for (String word : words) {
if (inputString.indexOf(word) == -1) {
found = false;
break;
}
}
return found;
}
The indexOf() method returns the index of the word inside of the inputString. When we don’t have the word in the text, the index will be -1.
indexOf()方法返回inputString里面的单词的索引。当我们在文本中没有这个词时,索引将是-1。
5. Using Regular Expressions
5.使用正则表达式
Now, let’s use a regular expression to match our words. For that, we’ll use the Pattern class.
现在,让我们使用regular expression来匹配我们的单词。为此,我们将使用Pattern类。
First, let’s define the string expression. As we need to match two keywords, we’ll build our regex rule with two lookaheads:
首先,我们来定义字符串表达式。由于我们需要匹配两个关键词,我们将用两个lookaheads建立我们的regex规则。
Pattern pattern = Pattern.compile("(?=.*hello)(?=.*Baeldung)");
And for the general case:
而对于一般情况。
StringBuilder regexp = new StringBuilder();
for (String word : words) {
regexp.append("(?=.*").append(word).append(")");
}
After that, we’ll use the matcher() method to find() the occurrences:
之后,我们将使用matcher()方法来find()出现的情况。
public static boolean containsWordsPatternMatch(String inputString, String[] words) {
StringBuilder regexp = new StringBuilder();
for (String word : words) {
regexp.append("(?=.*").append(word).append(")");
}
Pattern pattern = Pattern.compile(regexp.toString());
return pattern.matcher(inputString).find();
}
But, regular expressions have a performance cost. If we have multiple words to look up, the performance of this solution might not be optimal.
但是,正则表达式有一个性能成本。如果我们有多个单词需要查询,这个解决方案的性能可能不是最佳的。
6. Using Java 8 and List
6.使用Java 8和List
And finally, we can use Java 8’s Stream API. But first, let’s do some minor transformations with our initial data:
最后,我们可以使用Java 8的Stream API。但首先,让我们对我们的初始数据做一些小的转换。
List<String> inputString = Arrays.asList(inputString.split(" "));
List<String> words = Arrays.asList(words);
Now, it’s time to use the Stream API:
现在,是时候使用Stream API了。
public static boolean containsWordsJava8(String inputString, String[] words) {
List<String> inputStringList = Arrays.asList(inputString.split(" "));
List<String> wordsList = Arrays.asList(words);
return wordsList.stream().allMatch(inputStringList::contains);
}
The operation pipeline above will return true if the input string contains all of our keywords.
上面的操作管道将返回true ,如果输入字符串包含我们所有的关键字。
Alternatively, we can simply use the containsAll() method of the Collections framework to achieve the desired result:
另外,我们可以简单地使用Collections框架的containsAll()方法来实现预期的结果。
public static boolean containsWordsArray(String inputString, String[] words) {
List<String> inputStringList = Arrays.asList(inputString.split(" "));
List<String> wordsList = Arrays.asList(words);
return inputStringList.containsAll(wordsList);
}
However, this method works for whole words only. So, it would find our keywords only if they’re separated with whitespace within the text.
然而,这种方法只对整个单词有效。因此,只有在文本中用空格分隔的情况下,它才能找到我们的关键词。
7. Using the Aho-Corasick Algorithm
7.使用Aho-Corasick算法
Simply put, the Aho-Corasick algorithm is for text searching with multiple keywords. It has O(n) time complexity no matter how many keywords we’re searching for or how long the text length is.
简单地说,Aho-Corasick算法是用于多个关键词的文本搜索。无论我们要搜索多少个关键词或者文本长度有多长,它的时间复杂性都是O(n)。
Let’s include the Aho-Corasick algorithm dependency in our pom.xml:
让我们在我们的pom.xml中包括Aho-Corasick算法的依赖性。
<dependency>
<groupId>org.ahocorasick</groupId>
<artifactId>ahocorasick</artifactId>
<version>0.4.0</version>
</dependency>
First, let’s build the trie pipeline with the words array of keywords. For that, we’ll use the Trie data structure:
首先,让我们用关键词的words数组建立trie管道。为此,我们将使用Trie>数据结构。
Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();
After that, let’s call the parser method with the inputString text in which we would like to find the keywords and save the results in the emits collection:
之后,让我们用inputString文本调用解析器方法,我们想在其中找到关键词并将结果保存在emits集合中。
Collection<Emit> emits = trie.parseText(inputString);
And finally, if we print our results:
最后,如果我们打印我们的结果。
emits.forEach(System.out::println);
For each keyword, we’ll see the start position of the keyword in the text, the ending position, and the keyword itself:
对于每个关键词,我们将看到该关键词在文本中的起始位置、结束位置以及关键词本身。
0:4=hello
13:20=Baeldung
Finally, let’s see the complete implementation:
最后,让我们看看完整的实现。
public static boolean containsWordsAhoCorasick(String inputString, String[] words) {
Trie trie = Trie.builder().onlyWholeWords().addKeywords(words).build();
Collection<Emit> emits = trie.parseText(inputString);
emits.forEach(System.out::println);
boolean found = true;
for(String word : words) {
boolean contains = Arrays.toString(emits.toArray()).contains(word);
if (!contains) {
found = false;
break;
}
}
return found;
}
In this example, we’re looking for whole words only. So, if we want to match not only the inputString but “helloBaeldung” as well, we should simply remove the onlyWholeWords() attribute from the Trie builder pipeline.
在这个例子中,我们只寻找整个单词。因此,如果我们不仅要匹配inputString,还要匹配“helloBaeldung”,我们应该简单地从Trie构建器管道中删除onlyWholeWords()属性。
In addition, keep in mind that we also remove the duplicate elements from the emits collection, as there might be multiple matches for the same keyword.
此外,请记住,我们也从emits集合中删除重复的元素,因为同一关键词可能有多个匹配。
8. Conclusion
8.结论
In this article, we learned how to find multiple keywords inside a string. Moreover, we showed examples by using the core JDK, as well as with the Aho-Corasick library.
在这篇文章中,我们学习了如何在一个字符串中查找多个关键词。此外,我们通过使用核心JDK以及Aho-Corasick库来展示实例。
As usual, the complete code for this article is available over on GitHub.
像往常一样,本文的完整代码可以在GitHub上找到。