1. Overview
1.概述
The chore of searching for a pattern of characters, or a word, in a larger text string is done in various fields. In bioinformatics, for example, we may need to find a DNA snippet in a chromosome.
在一个较大的文本字符串中搜索一个字符的模式,或一个词,这样的苦差事在各个领域都会做。例如,在生物信息学中,我们可能需要在一个染色体中找到一个DNA片段。
In the media, editors locate a particular phrase in a voluminous text. Data surveillance detects scams or spam by looking for suspicious words embedded in data.
在媒体中,编辑在大量的文本中找到一个特定的短语。数据监控通过寻找嵌入数据中的可疑词语来检测诈骗或垃圾邮件。
In any context, the search is so well-known and daunting a chore that it is popularly called the “Needle in a Haystack Problem”. In this tutorial, we’ll demonstrate a simple algorithm that uses the indexOf(String str, int fromIndex) method of the Java String class to find all occurrences of a word within a string.
在任何情况下,搜索都是众所周知的、令人生畏的苦差事,以至于人们普遍将其称为“大海捞针问题”。在本教程中,我们将演示一种简单的算法,该算法使用Java String类的indexOf(String str, int fromIndex)方法来查找一个字符串中所有出现的单词。
2. Simple Algorithm
2.简单算法
Instead of simply counting the occurrences of a word in a larger text, our algorithm will find and identify every location where a specific word exists in the text. Our approach to the problem is short and simple so that:
我们的算法不是简单地计算一个词在大文本中的出现次数,而是找到并识别文本中存在一个特定词的每一个位置。我们处理问题的方法是简短的,这样。
- The search will find the word even within words in the text. Therefore, if we’re searching for the word “able” then we’ll find it in “comfortable” and “tablet”.
- The search will be case-insensitive.
- The algorithm is based on the naïve string search approach. This means that since we’re naïve about the nature of the characters in the word and the text string, we’ll use brute force to check every location of the text for an instance of the search word.
2.1. Implementation
2.1.实施
Now that we’ve defined the parameters for our search, let’s write a simple solution:
现在我们已经定义了搜索的参数,让我们来写一个简单的解决方案。
public class WordIndexer {
public List<Integer> findWord(String textString, String word) {
List<Integer> indexes = new ArrayList<Integer>();
String lowerCaseTextString = textString.toLowerCase();
String lowerCaseWord = word.toLowerCase();
int index = 0;
while(index != -1){
index = lowerCaseTextString.indexOf(lowerCaseWord, index);
if (index != -1) {
indexes.add(index);
index++;
}
}
return indexes;
}
}
2.2. Testing the Solution
2.2.测试解决方案
To test our algorithm, we’ll use a snippet of a famous passage from Shakespeare’s Hamlet and search for the word “or”, which appears five times:
为了测试我们的算法,我们将使用莎士比亚《哈姆雷特》中的一个著名段落的片段,并搜索出现了五次的 “或 “字。
@Test
public void givenWord_whenSearching_thenFindAllIndexedLocations() {
String theString;
WordIndexer wordIndexer = new WordIndexer();
theString = "To be, or not to be: that is the question: "
+ "Whether 'tis nobler in the mind to suffer "
+ "The slings and arrows of outrageous fortune, "
+ "Or to take arms against a sea of troubles, "
+ "And by opposing end them? To die: to sleep; "
+ "No more; and by a sleep to say we end "
+ "The heart-ache and the thousand natural shocks "
+ "That flesh is heir to, 'tis a consummation "
+ "Devoutly to be wish'd. To die, to sleep; "
+ "To sleep: perchance to dream: ay, there's the rub: "
+ "For in that sleep of death what dreams may come,";
List<Integer> expectedResult = Arrays.asList(7, 122, 130, 221, 438);
List<Integer> actualResult = wordIndexer.findWord(theString, "or");
assertEquals(expectedResult, actualResult);
}
When we run our test, we get the expected result. Searching for “or” yields five instances embedded in various ways in the text string:
当我们运行我们的测试时,我们得到了预期的结果。搜索 “或”,得到了五个以不同方式嵌入文本字符串的实例:。
index of 7, in "or"
index of 122, in "fortune"
index of 130, in "Or
index of 221, in "more"
index of 438, in "For"
In mathematical terms, the algorithm has a Big-O notation of O(m*(n-m)), where m is the length of the word and n is the length of the text string. This approach may be appropriate for haystack text strings of a few thousand characters but will be intolerably slow if there are billions of characters.
在数学上,该算法的大O符号为O(m*(n-m)),其中m是单词的长度,n是文本串的长度。这种方法可能适合于几千个字符的干草堆文本串,但如果有几十亿个字符,就会慢得无法忍受。
3. Improved Algorithm
3.改进的算法
The simple example above demonstrates a naïve, brute-force approach to searching for a given word in a text string. As such, it will work for any search word and any text.
上面的简单例子展示了一种天真、粗暴的方法来搜索文本字符串中的一个给定单词。因此,它将适用于任何搜索词和任何文本。
If we know in advance that the search word does not contain a repetitive pattern of characters, such as “aaa”, then we can write a slightly more efficient algorithm.
如果我们事先知道搜索词不包含重复的字符模式,例如 “aaa”,那么我们可以编写一个稍微有效的算法。
In this case, we can safely avoid doing the backup to re-check every location in the text string for as a potential starting location. After we make the call to the indexOf( ) method, we’ll simply slide over to the location just after the end of the latest occurrence found. This simple tweak yields a best-case scenario of O(n).
在这种情况下,我们可以安全地避免做备份,重新检查文本字符串中的每一个位置,以作为潜在的起始位置。在我们对indexOf()方法进行调用后,我们将简单地滑动到刚刚找到的最新出现的位置之后。这个简单的调整产生了一个O(n)的最佳情况。
Let’s look at this enhanced version of the earlier findWord( ) method.
让我们来看看这个早期findWord()方法的增强版。
public List<Integer> findWordUpgrade(String textString, String word) {
List<Integer> indexes = new ArrayList<Integer>();
StringBuilder output = new StringBuilder();
String lowerCaseTextString = textString.toLowerCase();
String lowerCaseWord = word.toLowerCase();
int wordLength = 0;
int index = 0;
while(index != -1){
index = lowerCaseTextString.indexOf(lowerCaseWord, index + wordLength); // Slight improvement
if (index != -1) {
indexes.add(index);
}
wordLength = word.length();
}
return indexes;
}
4. Conclusion
4.总结
In this tutorial, we presented a case-insensitive search algorithm to find all variations of a word in a larger text string. But don’t let that hide the fact that the Java String class’ indexOf() method is inherently case-sensitive and can distinguish between “Bob” and “bob”, for example.
在本教程中,我们介绍了一种不区分大小写的搜索算法,以便在一个较大的文本字符串中找到一个词的所有变化。但不要因此而掩盖一个事实,即Java String类的indexOf()方法本质上是区分大小写的,可以区分 “Bob “和 “bob”,比如说。
Altogether, indexOf() is a convenient method for finding a character sequence buried in a text string without doing any coding for substring manipulations.
总的来说,indexOf()是一种方便的方法,可以找到埋藏在文本字符串中的字符序列,而无需对子串操作进行任何编码。
As usual, the complete codebase of this example is over on GitHub.
像往常一样,这个例子的完整代码库在GitHub上。