1. Introduction
1.介绍
In this article, we describe the Levenshtein distance, alternatively known as the Edit distance. The algorithm explained here was devised by a Russian scientist, Vladimir Levenshtein, in 1965.
在这篇文章中,我们描述了列文施泰因距离,也被称为Edit距离。这里解释的算法是由一位俄罗斯科学家Vladimir Levenshtein在1965年设计的。
We’ll provide an iterative and a recursive Java implementation of this algorithm.
我们将提供这种算法的迭代和递归的Java实现。
2. What Is the Levenshtein Distance?
2.什么是列文施泰因距离?
The Levenshtein distance is a measure of dissimilarity between two Strings. Mathematically, given two Strings x and y, the distance measures the minimum number of character edits required to transform x into y.
Levenshtein距离是衡量两个字符串之间的不相似性。在数学上,给定两个字符串 x和y,该距离测量将x转换为y所需的最小字符编辑数。
Typically three type of edits are allowed:
一般来说,允许三种类型的编辑。
- Insertion of a character c
- Deletion of a character c
- Substitution of a character c with c‘
Example: If x = ‘shot’ and y = ‘spot’, the edit distance between the two is 1 because ‘shot’ can be converted to ‘spot’ by substituting ‘h‘ to ‘p‘.
例子。如果x = ‘shot’和y = ‘spot’,两者之间的编辑距离是1,因为‘shot’可以通过将’h‘替换为’p‘来转换为’spot’。
In certain sub-classes of the problem, the cost associated with each type of edit may be different.
在问题的某些子类中,与每种类型的编辑相关的成本可能是不同的。
For example, less cost for substitution with a character located nearby on the keyboard and more cost otherwise. For simplicity, we’ll consider all costs to be equal in this article.
例如,用键盘上附近的字符进行替换的成本较低,反之则成本较高。为了简单起见,我们在本文中认为所有成本都是相等的。
Some of the applications of edit distance are:
编辑距离的一些应用是。
- Spell Checkers – detecting spelling errors in text and find the closest correct spelling in dictionary
- Plagiarism Detection (refer – IEEE Paper)
- DNA Analysis – finding similarity between two sequences
- Speech Recognition (refer – Microsoft Research)
3. Algorithm Formulation
3.算法的制定
Let’s take two Strings x and y of lengths m and n respectively. We can denote each String as x[1:m] and y[1:n].
让我们取两个长度分别为m和n的字符串。我们可以把每个String表示为x[1:m]和y[1:n].。
We know that at the end of the transformation, both Strings will be of equal length and have matching characters at each position. So, if we consider the first character of each String, we’ve got three options:
我们知道,在转换结束时,两个字符串将是等长的,并且在每个位置都有匹配的字符。因此,如果我们考虑每个字符串的第一个字符,我们有三个选择。
- Substitution:
- Determine the cost (D1) of substituting x[1] with y[1]. The cost of this step would be zero if both characters are same. If not, then the cost would be one
- After step 1.1, we know that both Strings start with the same character. Hence the total cost would now be the sum of the cost of step 1.1 and the cost of transforming the rest of the String x[2:m] into y[2:n]
- Insertion:
- Insert a character in x to match the first character in y, the cost of this step would be one
- After 2.1, we have processed one character from y. Hence the total cost would now be the sum of the cost of step 2.1 (i.e., 1) and the cost of transforming the full x[1:m] to remaining y (y[2:n])
- Deletion:
- Delete the first character from x, the cost of this step would be one
- After 3.1, we have processed one character from x, but the full y remains to be processed. The total cost would be the sum of the cost of 3.1 (i.e., 1) and the cost of transforming remaining x to the full y
The next part of the solution is to figure out which option to choose out of these three. Since we do not know which option would lead to minimum cost at the end, we must try all options and choose the best one.
解决方案的下一个部分是弄清楚在这三个选项中选择哪一个。由于我们不知道哪种方案会导致最后的最低成本,我们必须尝试所有的方案并选择最佳方案。
4. Naive Recursive Implementation
4.天真递归的实现
We can see that the second step of each option in section #3 is mostly the same edit distance problem but on sub-strings of the original Strings. This means after each iteration we end up with the same problem but with smaller Strings.
我们可以看到,第3节中每个选项的第二步主要是相同的编辑距离问题,但在原始字符串的子字符串上。这意味着在每次迭代之后,我们最终得到的是相同的问题,不过是小的字符串。
This observation is the key to formulate a recursive algorithm. The recurrence relation can be defined as:
这一观察是制定递归算法的关键。递归关系可以定义为:。
D(x[1:m], y[1:n]) = min {
D(x[1:m], y[1:n]) = min {
D(x[2:m], y[2:n]) + Cost of Replacing x[1] to y[1],
D(x[2:m], y[2:n]) + 将x[1]替换成y[1]的成本,
D(x[1:m], y[2:n]) + 1,
D(x[1:m], y[2:n]) + 1,
D(x[2:m], y[1:n]) + 1
D(x[2:m], y[1:n]) + 1
}
}
We must also define base cases for our recursive algorithm, which in our case is when one or both Strings become empty:
我们还必须为我们的递归算法定义基例,在我们的案例中,基例就是当一个或两个字符串变成空时。
- When both Strings are empty, then the distance between them is zero
- When one of the Strings is empty, then the edit distance between them is the length of the other String, as we need that many numbers of insertions/deletions to transform one into the other:
- Example: if one String is “dog” and other String is “” (empty), we need either three insertions in empty String to make it “dog”, or we need three deletions in “dog” to make it empty. Hence the edit distance between them is 3
A naive recursive implementation of this algorithm:
该算法的一个天真递归实现。
public class EditDistanceRecursive {
static int calculate(String x, String y) {
if (x.isEmpty()) {
return y.length();
}
if (y.isEmpty()) {
return x.length();
}
int substitution = calculate(x.substring(1), y.substring(1))
+ costOfSubstitution(x.charAt(0), y.charAt(0));
int insertion = calculate(x, y.substring(1)) + 1;
int deletion = calculate(x.substring(1), y) + 1;
return min(substitution, insertion, deletion);
}
public static int costOfSubstitution(char a, char b) {
return a == b ? 0 : 1;
}
public static int min(int... numbers) {
return Arrays.stream(numbers)
.min().orElse(Integer.MAX_VALUE);
}
}
This algorithm has the exponential complexity. At each step, we branch-off into three recursive calls, building an O(3^n) complexity.
这个算法的复杂度是指数级的。在每一步,我们将分支成三个递归调用,建立一个O(3^n)复杂度。
In the next section, we’ll see how to improve upon this.
在下一节,我们将看到如何在此基础上进行改进。
5. Dynamic Programming Approach
5.动态编程方法
On analyzing the recursive calls, we observe that the arguments for sub-problems are suffixes of the original Strings. This means there can only be m*n unique recursive calls (where m and n are a number of suffixes of x and y). Hence the complexity of the optimal solution should be quadratic, O(m*n).
在分析递归调用时,我们发现子问题的参数是原始字符串的后缀。这意味着只能有m*n唯一的递归调用(其中m和n是x和y的若干后缀)。因此,最优解的复杂度应该是二次的,O(m*n)。
Lets look at some of the sub-problems (according to recurrence relation defined in section #4):
让我们看看一些子问题(根据第4节中定义的递归关系)。
- Sub-problems of D(x[1:m], y[1:n]) are D(x[2:m], y[2:n]), D(x[1:m], y[2:n]) and D(x[2:m], y[1:n])
- Sub-problems of D(x[1:m], y[2:n]) are D(x[2:m], y[3:n]), D(x[1:m], y[3:n]) and D(x[2:m], y[2:n])
- Sub-problems of D(x[2:m], y[1:n]) are D(x[3:m], y[2:n]), D(x[2:m], y[2:n]) and D(x[3:m], y[1:n])
In all three cases, one of the sub-problems is D(x[2:m], y[2:n]). Instead of calculating this three times like we do in the naive implementation, we can calculate this once and reuse the result whenever needed again.
在所有三种情况下,其中一个子问题是D(x[2:m], y[2:n])。我们不需要像在天真的实现中那样计算三次,而是可以计算一次,并在需要时再次使用结果。
This problem has a lot of overlapping sub-problems, but if we know the solution to the sub-problems, we can easily find the answer to the original problem. Therefore, we have both of the properties needed for formulating a dynamic programming solution, i.e., Overlapping Sub-Problems and Optimal Substructure.
这个问题有很多重叠的子问题,但如果我们知道子问题的解决方案,我们就可以很容易地找到原始问题的答案。因此,我们拥有制定动态编程解决方案所需的两个属性,即重叠子问题和最优的结构。
We can optimize the naive implementation by introducing memoization, i.e., store the result of the sub-problems in an array and reuse the cached results.
我们可以通过引入记忆来优化天真的实现,即把子问题的结果存储在一个数组中并重复使用缓存的结果。
Alternatively, we can also implement this iteratively by using a table based approach:
另外,我们也可以通过使用基于表格的方法来迭代实现这一点。
static int calculate(String x, String y) {
int[][] dp = new int[x.length() + 1][y.length() + 1];
for (int i = 0; i <= x.length(); i++) {
for (int j = 0; j <= y.length(); j++) {
if (i == 0) {
dp[i][j] = j;
}
else if (j == 0) {
dp[i][j] = i;
}
else {
dp[i][j] = min(dp[i - 1][j - 1]
+ costOfSubstitution(x.charAt(i - 1), y.charAt(j - 1)),
dp[i - 1][j] + 1,
dp[i][j - 1] + 1);
}
}
}
return dp[x.length()][y.length()];
}
This algorithm performs significantly better than the recursive implementation. However, it involves significant memory consumption.
该算法的性能明显优于递归实现。然而,它涉及大量的内存消耗。
This can further be optimized by observing that we only need the value of three adjacent cells in the table to find the value of the current cell.
通过观察我们只需要表格中三个相邻单元格的值就可以找到当前单元格的值,这可以进一步优化。
6. Conclusion
6.结论
In this article, we described what is Levenshtein distance and how it can be calculated using a recursive and a dynamic-programming based approach.
在这篇文章中,我们描述了什么是列文斯坦距离,以及如何使用递归和基于动态编程的方法来计算它。
Levenshtein distance is only one of the measures of string similarity, some of the other metrics are Cosine Similarity (which uses a token-based approach and considers the strings as vectors), Dice Coefficient, etc.
Levenshtein距离只是衡量字符串相似性的指标之一,其他一些指标有Cosine相似性(它使用基于标记的方法,将字符串视为向量),Dice系数等。
As always the full implementation of examples can be found over on GitHub.
一如既往,完整的实施示例可以在GitHub上找到over。