Finding the Difference Between Two Strings in Java – 在Java中寻找两个字符串之间的差异

最后修改: 2019年 9月 8日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

This quick tutorial will show how to find the difference between two strings using Java.

这个快速教程将展示如何使用Java来查找两个字符串之间的差异

For this tutorial, we’re going to use two existing Java libraries and compare their approaches to this problem.

在本教程中,我们将使用两个现有的Java库并比较它们解决这个问题的方法。

2. The Problem

2.问题

Let’s consider the following requirement: we want to find the difference between the strings ABCDELMN” and “ABCFGLMN”.

让我们考虑以下要求:我们想找到字符串ABCDELMN “和 “ABCFGLMN “之间的差异。

Depending on what format we need the output to be, and ignoring the possibility to write our custom code to do so, we found two main options available.

根据我们需要的输出格式,并忽略了编写我们的自定义代码的可能性,我们发现有两个主要的选择。

The first one is a library written by Google called diff-match-patch. As they claim, the library offers robust algorithms for synchronizing plain text.

第一个是谷歌编写的库,名为diff-match-patch正如他们所声称的,该库提供稳健的算法来同步纯文本

The other option is the StringUtils class from Apache Commons Lang.

另一个选择是StringUtils类,来自Apache Commons Lang。

Let’s explore the differences between these two.

让我们探讨一下这两者之间的区别。

3. diff-match-patch

3.diff-match-patch

For the purpose of this article, we will use a fork of the original Google library, as the artifacts for the original one are not released on Maven Central. Also, some class names are different from the original codebase and are more adherent to the Java standards.

在本文中,我们将使用谷歌原始库的分叉,因为原始库的工件没有在Maven Central上发布。另外,一些类的名称与原始代码库不同,更符合Java标准。

First, we’ll need to include its dependency in our pom.xml file:

首先,我们需要在我们的pom.xml文件中包含其依赖性。

<dependency>
    <groupId>org.bitbucket.cowwoc</groupId>
    <artifactId>diff-match-patch</artifactId>
    <version>1.2</version>
</dependency>

Then, let’s consider this code:

然后,让我们考虑一下这段代码。

String text1 = "ABCDELMN";
String text2 = "ABCFGLMN";
DiffMatchPatch dmp = new DiffMatchPatch();
LinkedList<Diff> diff = dmp.diffMain(text1, text2, false);

If we run the above code – which produces the difference between text1 and text2 – printing the variable diff will produce this output:

如果我们运行上述代码–产生text1text2之间的差异–打印变量diff将产生这样的输出。

[Diff(EQUAL,"ABC"), Diff(DELETE,"DE"), Diff(INSERT,"FG"), Diff(EQUAL,"LMN")]

In fact, the output will be a list of Diff objects, each one being formed by an operation type (INSERT, DELETE or EQUAL), and the portion of text associated with the operation.

事实上,输出将是一个Diff对象的列表,每个对象由一个操作类型(INSERTDELETEEQUAL)和与该操作相关的文本部分构成。

When running the diff between text2 and text1, we will get this result:

当运行text2text1之间的diff时,我们将得到这个结果。

[Diff(EQUAL,"ABC"), Diff(DELETE,"FG"), Diff(INSERT,"DE"), Diff(EQUAL,"LMN")]

4. StringUtils

4.StringUtils

The class from Apache Commons has a more simplistic approach.

来自Apache Commons的类有一个更简单的方法

First, we’ll add the Apache Commons Lang dependency to our pom.xml file:

首先,我们将把Apache Commons Lang依赖项添加到我们的pom.xml文件中。

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-lang3</artifactId>
    <version>3.12.0</version>
</dependency>

Then, to find the difference between two texts with Apache Commons, we’d call StringUtils#Difference:

然后,为了用Apache Commons找到两个文本之间的差异,我们会调用StringUtils#Difference

StringUtils.difference(text1, text2)

The output produced will be a simple string:

产生的输出将是一个简单的字符串

FGLMN

Whereas running the diff between text2 and text1 will return:

而运行text2text1之间的diff会返回。

DELMN

This simple approach can be enhanced using StringUtils.indexOfDifference(), which will return the index at which the two strings start to differ (in our case, the fourth character of the string). This index can be used to get a substring of the original string, to show what is common between the two inputs, in addition to what’s different.

这个简单的方法可以用StringUtils.indexOfDifference()来加强,它将返回两个字符串开始不同的索引(在我们的例子中,是字符串的第四个字符)。这个索引可以用来获取原始字符串的子串,以显示两个输入之间的共同点,以及不同之处。

5. Performance

5.表现

For our benchmarks, we generate a list of 10,000 strings with a fixed portion of 10 characters, followed by 20 random alphabetic characters.

对于我们的基准,我们生成了一个10,000个字符串的列表,其中有10个字符的固定部分,然后是20个随机字母字符

We then loop through the list and perform a diff between the nth element and the n+1th element of the list:

然后我们循环浏览列表,在列表中的nth元素和n+1th元素之间进行差异。

@Benchmark
public int diffMatchPatch() {
    for (int i = 0; i < inputs.size() - 1; i++) {
        diffMatchPatch.diffMain(inputs.get(i), inputs.get(i + 1), false);
    }
    return inputs.size();
}
@Benchmark
public int stringUtils() {
    for (int i = 0; i < inputs.size() - 1; i++) {
        StringUtils.difference(inputs.get(i), inputs.get(i + 1));
    }
    return inputs.size();
}

Finally, let’s run the benchmarks and compare the two libraries:

最后,让我们运行基准测试并比较这两个库。

Benchmark                                   Mode  Cnt    Score   Error  Units
StringDiffBenchmarkUnitTest.diffMatchPatch  avgt   50  130.559 ± 1.501  ms/op
StringDiffBenchmarkUnitTest.stringUtils     avgt   50    0.211 ± 0.003  ms/op

6. Conclusion

6.结论

In terms of pure execution speed, StringUtils is clearly more performant, although it only returns the substring from which the two strings start to differ.

就纯粹的执行速度而言,StringUtils显然更有性能,尽管它只返回两个字符串开始不同的子字符串。

At the same time, Diff-Match-Patch provides a more thorough comparison result, at the expense of performance.

同时,Diff-Match-Patch提供了一个更彻底的比较结果,但要牺牲性能。

The implementation of these examples and snippets is available over on GitHub.

这些例子和片段的实现可以在GitHub上获得