1. Introduction
1.介绍
In this article, we’ll show several algorithms for searching for a pattern in a large text. We’ll describe each algorithm with provided code and simple mathematical background.
在这篇文章中,我们将展示几种在大型文本中搜索模式的算法。我们将用提供的代码和简单的数学背景来描述每个算法。
Notice that provided algorithms are not the best way to do a full-text search in more complex applications. To do full-text search properly, we can use Solr or ElasticSearch.
请注意,所提供的算法并不是在更复杂的应用中进行全文搜索的最佳方式。为了正确地进行全文搜索,我们可以使用Solr或ElasticSearch。
2. Algorithms
2.算法
We’ll start with a naive text search algorithm which is the most intuitive one and helps to discover other advanced problems associated with that task.
我们将从一个天真的文本搜索算法开始,这是最直观的算法,有助于发现与该任务相关的其他高级问题。
2.1. Helper Methods
2.1.助手方法
Before we start, let’s define simple methods for calculating prime numbers which we use in Rabin Karp algorithm:
在我们开始之前,让我们定义计算素数的简单方法,我们在拉宾卡普算法中使用这些方法。
public static long getBiggerPrime(int m) {
BigInteger prime = BigInteger.probablePrime(getNumberOfBits(m) + 1, new Random());
return prime.longValue();
}
private static int getNumberOfBits(int number) {
return Integer.SIZE - Integer.numberOfLeadingZeros(number);
}
2.2. Simple Text Search
2.2.简单的文本搜索
Name of this algorithm describes it better than any other explanation. It’s the most natural solution:
这种算法的名称比任何其他解释都更能说明问题。这是最自然的解决方案。
public static int simpleTextSearch(char[] pattern, char[] text) {
int patternSize = pattern.length;
int textSize = text.length;
int i = 0;
while ((i + patternSize) <= textSize) {
int j = 0;
while (text[i + j] == pattern[j]) {
j += 1;
if (j >= patternSize)
return i;
}
i += 1;
}
return -1;
}
The idea of this algorithm is straightforward: iterate through the text and if there is a match for the first letter of the pattern, check if all the letters of the pattern match the text.
这个算法的思路很简单:遍历文本,如果有一个匹配的模式的第一个字母,则检查该模式的所有字母是否与文本匹配。
If m is a number of the letters in the pattern, and n is the number of the letters in the text, time complexity of this algorithms is O(m(n-m + 1)).
如果m是图案中的字母数,n是文本中的字母数,该算法的时间复杂度为O(m(n-m+1))。
Worst-case scenario occurs in the case of a String having many partial occurrences:
最坏的情况发生在一个String有许多部分出现的情况下。
Text: baeldunbaeldunbaeldunbaeldun
Pattern: baeldung
2.3. Rabin Karp Algorithm
2.3.拉宾-卡普算法
As mentioned above, Simple Text Search algorithm is very inefficient when patterns are long and when there is a lot of repeated elements of the pattern.
如上所述,简单文本搜索算法在模式较长和模式中存在大量重复元素时,效率非常低。
The idea of Rabin Karp algorithm is to use hashing to find a pattern in a text. At the beginning of the algorithm, we need to calculate a hash of the pattern which is later used in the algorithm. This process is called fingerprint calculation, and we can find a detailed explanation here.
Rabin Karp算法的理念是使用散列法来寻找文本中的模式。在算法的开始,我们需要计算图案的散列值,这将在以后的算法中使用。这个过程被称为指纹计算,我们可以在这里找到一个详细的解释。
The important thing about pre-processing step is that its time complexity is O(m) and iteration through text will take O(n) which gives time complexity of whole algorithm O(m+n).
预处理步骤的重要之处在于,其时间复杂度为O(m),通过文本的迭代将需要O(n),这使得整个算法的时间复杂度为O(m+n)。
Code of the algorithm:
算法的代码。
public static int RabinKarpMethod(char[] pattern, char[] text) {
int patternSize = pattern.length;
int textSize = text.length;
long prime = getBiggerPrime(patternSize);
long r = 1;
for (int i = 0; i < patternSize - 1; i++) {
r *= 2;
r = r % prime;
}
long[] t = new long[textSize];
t[0] = 0;
long pfinger = 0;
for (int j = 0; j < patternSize; j++) {
t[0] = (2 * t[0] + text[j]) % prime;
pfinger = (2 * pfinger + pattern[j]) % prime;
}
int i = 0;
boolean passed = false;
int diff = textSize - patternSize;
for (i = 0; i <= diff; i++) {
if (t[i] == pfinger) {
passed = true;
for (int k = 0; k < patternSize; k++) {
if (text[i + k] != pattern[k]) {
passed = false;
break;
}
}
if (passed) {
return i;
}
}
if (i < diff) {
long value = 2 * (t[i] - r * text[i]) + text[i + patternSize];
t[i + 1] = ((value % prime) + prime) % prime;
}
}
return -1;
}
In worst-case scenario, time complexity for this algorithm is O(m(n-m+1)). However, on average this algorithm has O(n+m) time complexity.
在最坏的情况下,该算法的时间复杂度为O(m(n-m+1))。然而,平均来说,这个算法的时间复杂度O(n+m)。
Additionally, there is Monte Carlo version of this algorithm which is faster, but it can result in wrong matches (false positives).
此外,这种算法还有蒙特卡洛版本,速度更快,但它可能导致错误的匹配(假阳性)。
2.4. Knuth-Morris-Pratt Algorithm
2.4.Knuth-Morris-Pratt算法
In the Simple Text Search algorithm, we saw how the algorithm could be slow if there are many parts of the text which match the pattern.
在简单文本搜索算法中,我们看到,如果文本中有许多与模式相匹配的部分,该算法会很慢。
The idea of the Knuth-Morris-Pratt algorithm is the calculation of shift table which provides us with the information where we should search for our pattern candidates.
Knuth-morris-Pratt算法的理念是计算移位表,为我们提供应该在哪里搜索我们的模式候选者的信息。
Java implementation of KMP algorithm:
KMP算法的Java实现。
public static int KnuthMorrisPrattSearch(char[] pattern, char[] text) {
int patternSize = pattern.length;
int textSize = text.length;
int i = 0, j = 0;
int[] shift = KnuthMorrisPrattShift(pattern);
while ((i + patternSize) <= textSize) {
while (text[i + j] == pattern[j]) {
j += 1;
if (j >= patternSize)
return i;
}
if (j > 0) {
i += shift[j - 1];
j = Math.max(j - shift[j - 1], 0);
} else {
i++;
j = 0;
}
}
return -1;
}
And here is how we calculate shift table:
下面是我们计算移位表的方法。
public static int[] KnuthMorrisPrattShift(char[] pattern) {
int patternSize = pattern.length;
int[] shift = new int[patternSize];
shift[0] = 1;
int i = 1, j = 0;
while ((i + j) < patternSize) {
if (pattern[i + j] == pattern[j]) {
shift[i + j] = i;
j++;
} else {
if (j == 0)
shift[i] = i + 1;
if (j > 0) {
i = i + shift[j - 1];
j = Math.max(j - shift[j - 1], 0);
} else {
i = i + 1;
j = 0;
}
}
}
return shift;
}
The time complexity of this algorithm is also O(m+n).
这个算法的时间复杂度也是O(m+n)。
2.5. Simple Boyer-Moore Algorithm
2.5.简单的Boyer-Moore算法
Two scientists, Boyer and Moore, came up with another idea. Why not compare the pattern to the text from right to left instead of left to right, while keeping the shift direction the same:
两位科学家,博耶和摩尔,提出了另一个想法。为什么不把图案与文字从右到左而不是从左到右进行比较,同时保持转变方向不变。
public static int BoyerMooreHorspoolSimpleSearch(char[] pattern, char[] text) {
int patternSize = pattern.length;
int textSize = text.length;
int i = 0, j = 0;
while ((i + patternSize) <= textSize) {
j = patternSize - 1;
while (text[i + j] == pattern[j]) {
j--;
if (j < 0)
return i;
}
i++;
}
return -1;
}
As expected, this will run in O(m * n) time. But this algorithm led to the implementation of occurrence and the match heuristics which speeds up the algorithm significantly. We can find more here.
正如预期,这将在O(m * n)时间内运行。但是这个算法导致了发生率和匹配启发式算法的实现,从而大大加快了算法的速度。我们可以在这里找到更多的。
2.6. Boyer-Moore-Horspool Algorithm
2.6.博耶-摩尔-霍斯珀尔算法
There are many variations of heuristic implementation of the Boyer-Moore algorithm, and simplest one is Horspool variation.
博耶-摩尔算法的启发式实现有很多变化,最简单的是Horspool的变化。
This version of the algorithm is called Boyer-Moore-Horspool, and this variation solved the problem of negative shifts (we can read about negative shift problem in the description of the Boyer-Moore algorithm).
这个版本的算法被称为Boyer-Moore-Horspool,这个变体解决了负移位的问题(我们可以在Boyer-Moore算法的描述中阅读负移位问题)。
Like Boyer-Moore algorithm, worst-case scenario time complexity is O(m * n) while average complexity is O(n). Space usage doesn’t depend on the size of the pattern, but only on the size of the alphabet which is 256 since that is the maximum value of ASCII character in English alphabet:
与Boyer-Moore算法一样,最坏情况下的时间复杂度为O(m * n),而平均复杂度为O(n)。空间使用不取决于模式的大小,而只取决于字母表的大小,字母表的大小为256,因为这是英文字母表中ASCII字符的最大值。
public static int BoyerMooreHorspoolSearch(char[] pattern, char[] text) {
int shift[] = new int[256];
for (int k = 0; k < 256; k++) {
shift[k] = pattern.length;
}
for (int k = 0; k < pattern.length - 1; k++){
shift[pattern[k]] = pattern.length - 1 - k;
}
int i = 0, j = 0;
while ((i + pattern.length) <= text.length) {
j = pattern.length - 1;
while (text[i + j] == pattern[j]) {
j -= 1;
if (j < 0)
return i;
}
i = i + shift[text[i + pattern.length - 1]];
}
return -1;
}
4. Conclusion
4.结论
In this article, we presented several algorithms for text search. Since several algorithms require stronger mathematical background, we tried to represent the main idea beneath each algorithm and provide it in a simple manner.
在这篇文章中,我们介绍了几种文本搜索的算法。由于几种算法都需要较强的数学背景,我们试图在每种算法下表现出主要思想,并以简单的方式提供。
And, as always, the source code can be found over on GitHub.
而且,像往常一样,可以在GitHub上找到源代码。