An Overview of Regular Expressions Performance in Java – Java中的正则表达式性能概述

最后修改: 2018年 8月 17日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this quick tutorial, we’ll show how the pattern-matching engine works. We’ll also present different ways to optimize regular expressions in Java.

在这个快速教程中,我们将展示模式匹配引擎如何工作。我们还将介绍在Java中优化regular expressions的不同方法。

For an introduction to the use of regular expressions, please refer to this article here.

关于正则表达式的使用介绍,请参考这里的文章

2. The Pattern-Matching Engine

2.模式匹配引擎

The java.util.regex package uses a type of pattern-matching engine called a Nondeterministic Finite Automaton (NFA). It’s considered nondeterministic because while trying to match a regular expression on a given string, each character in the input might be checked several times against different parts of the regular expression.

java.util.regex包使用了一种叫做非确定性有限自动机(NFA)的模式匹配引擎。它被认为是非确定性的,因为在试图匹配一个给定字符串的正则表达式时,输入的每个字符可能会根据正则表达式的不同部分被检查几次。

In the background, the engine mentioned above uses backtracking. This general algorithm tries to exhaust all possibilities until it declares failure. Consider the following example to better understand the NFA:

在后台,上面提到的引擎使用反向追踪。这个一般的算法试图穷尽所有的可能性,直到它宣布失败。考虑一下下面的例子,以更好地理解NFA

"tra(vel|ce|de)m"

With the input String “travel“, the engine first will look for “tra” and find it immediately.

在输入Stringtravel“时,引擎首先会寻找”tra“并立即找到它。

After that, it’ll try to match “vel” starting from the fourth character. This will match, so it will go forward and try to match “m“.

之后,它将尝试从第四个字符开始匹配”vel“。这将匹配,所以它将继续前进并尝试匹配”m“。

That won’t match, and for that reason, it’ll go back to the fourth character and search for “ce“. Again, this won’t match, so it’ll go back again to the fourth position and try with “de“. That string won’t match either, and so it’ll go back to the second character in the input string and try to search for another “tra“.

这不会匹配,为此,它将回到第四个字符并搜索”ce“。同样,这也不匹配,所以它将再次回到第四个位置并尝试使用”de“。这个字符串也不匹配,所以它将回到输入字符串的第二个字符,并尝试搜索另一个”tra“。

With the last failure, the algorithm will return failure.

随着最后一次失败,该算法将返回失败。

With the simple last example, the engine had to backtrack several times while trying to match the input String to the regular expression. Because of that, it’s important to minimize the amount of backtracking that it does.

在上一个简单的例子中,引擎在试图将输入的String与正则表达式相匹配时不得不多次回溯。正因为如此,尽量减少它的回溯量很重要。

3. Ways to Optimize Regular Expressions

3.优化常规表达式的方法

3.1. Avoid Recompilation

3.1.避免重新编译

Regular expressions in Java are compiled into an internal data structure. This compilation is the time-consuming process.

Java中的正则表达式被编译成一个内部数据结构。这种编译是很耗时的过程。

Each time we invoke the String.matches(String regex) method, the specified regular expression is recompiled:

每次我们调用String.matches(String regex) 方法时,指定的正则表达式会被重新编译。

if (input.matches(regexPattern)) {
    // do something
}

As we can see, every time the condition is evaluated, the regex expression is compiled.

正如我们所看到的,每次条件被评估时,都会对regex表达式进行编译。

To optimize, it’s possible to compile the pattern first and then create a Matcher to find the coincidences in the value:

为了优化,可以先编译模式,然后创建一个Matcher来寻找值中的重合点。

Pattern pattern = Pattern.compile(regexPattern);
for(String value : values) {
    Matcher matcher = pattern.matcher(value);
    if (matcher.matches()) {
        // do something
    }
}

An alternative to the above optimization is using the same Matcher instance with its reset() method:

上述优化的一个替代方案是使用同一个Matcher实例及其reset()方法。

Pattern pattern = Pattern.compile(regexPattern);
Matcher matcher = pattern.matcher("");
for(String value : values) {
    matcher.reset(value);
    if (matcher.matches()) {
      // do something
    }
}

Due to the fact of Matcher isn’t thread safe, we have to be cautious with the use of this variation. It could be likely dangerous in multi-threaded scenarios.

由于Matcher不是线程安全的,我们必须谨慎地使用这个变量。在多线程的情况下,它可能会有危险。

To summarize, in every situation where we’re sure there’s only one user of the Matcher at any point in time, it’s OK to reuse it with reset. For the rest, reusing the precompiled it’s enough.

总结一下,在每一种情况下,如果我们确定在任何时候只有一个Matcher的用户,那么用reset重用它是可以的。对于其他情况,重用预编译的就足够了。

3.2. Working with Alternation

3.2.使用交替法

As we just checked in the last section, the inadequate use of alternations could be harmful to the performance. It’s important to place options that are more likely to happen in the front so they can be matched faster.

正如我们在上一节中所检查的那样,对交替的使用不充分可能会对性能产生危害。把更容易发生的选项放在前面是很重要的,这样可以更快地匹配。

Also, we have to extract common patterns between them. It isn’t the same to put:

另外,我们必须提取它们之间的共同模式。这并不是说要把。

(travel | trade | trace)

Than:

比。

tra(vel | de | ce)

The latter is faster because the NFA will try to match “tra” and won’t try any of the alternatives if it doesn’t find it.

后者更快,因为NFA将尝试匹配”tra“,如果找不到它,就不会尝试任何替代品。

3.3. Capturing Groups

3.3.捕获群体

Each time we’re capturing groups, we’re incurring in a small-time penalty.

每次我们在抓捕小组的时候,都会招致小规模的惩罚。

If we don’t need to capture the text inside a group, we should consider the use of non-capturing groups. Instead of use “(M)“, please use “(?:M)“.

如果我们不需要捕捉组内的文本,我们应该考虑使用非捕捉组。不要使用”(M)“,请使用”(?:M)“。

4. Conclusion

4.总结

In this quick article, we briefly revisited how NFA works. We then proceeded to explore how to optimize the performance of our regular expressions by pre-compiling our patterns and reuse a Matcher.

在这篇快速文章中,我们简要地重温了NFA的工作原理。然后我们继续探讨如何通过预编译模式和重用Matcher来优化正则表达式的性能。

Finally, we pointed out a couple of considerations to keep in mind while we work with alternations and groups.

最后,我们指出了一些注意事项,在我们与交替和分组工作时,要牢记在心。

As usual, the full source code can be found over on GitHub.

像往常一样,完整的源代码可以在GitHub上找到超过