1. Overview
1.概述
Sometimes we might face difficulty matching a string with a regular expression. For example, we might not know what we want to match exactly, but we can be aware of its surroundings, like what comes directly before it or what is missing from after it. In these cases, we can use the lookaround assertions. These expressions are called assertions because they only indicate if something is a match or not but are not included in the result.
有时我们可能会面临用regular expression匹配一个字符串的困难。例如,我们可能不知道我们想要准确地匹配什么,但是我们可以意识到它的周围环境,比如它的正前面是什么,或者它后面缺少什么。在这些情况下,我们可以使用lookaround断言。这些表达式之所以被称为断言,是因为它们只表明某物是否匹配,但不包括在结果中。
In this tutorial, we’ll take a look at how we can use the four types of regex lookaround assertions.
在本教程中,我们将看看如何使用四种类型的regex lookaround断言。
2. Positive Lookahead
2.正面展望
Let’s say we’d like to analyze the imports of java files. First, let’s look for import statements that are static by checking that the static keyword follows the import keyword.
假设我们想分析java文件的导入。首先,让我们通过检查static关键字是否在import关键字之后,来寻找static的导入语句。
Let’s use a positive lookahead assertion with the (?=criteria) syntax in our expression to match the group of characters static after our main expression import:
让我们在表达式中使用带有(?=criteria)语法的正向查找断言来匹配我们的主表达式import之后的一组字符static。
Pattern pattern = Pattern.compile("import (?=static).+");
Matcher matcher = pattern
.matcher("import static org.junit.jupiter.api.Assertions.assertEquals;");
assertTrue(matcher.find());
assertEquals("import static org.junit.jupiter.api.Assertions.assertEquals;", matcher.group());
assertFalse(pattern.matcher("import java.util.regex.Matcher;").find());
3. Negative Lookahead
3.负向展望
Next, let’s do the direct opposite of the previous example and look for import statements that are not static. Let’s do this by checking that the static keyword does not follow the import keyword.
接下来,让我们与前面的例子直接相反,寻找不是static的导入语句。让我们通过检查static关键字是否跟在import关键字后面来做到这一点。
Let’s use a negative lookahead assertion with the (?!criteria) syntax in our expression to ensure that the group of characters static cannot match after our main expression import:
让我们在表达式中使用带有(?!criteria)语法的负向查找断言,以确保字符组static在我们的主表达式import之后不能匹配。
Pattern pattern = Pattern.compile("import (?!static).+");
Matcher matcher = pattern.matcher("import java.util.regex.Matcher;");
assertTrue(matcher.find());
assertEquals("import java.util.regex.Matcher;", matcher.group());
assertFalse(pattern
.matcher("import static org.junit.jupiter.api.Assertions.assertEquals;").find());
4. Limitations of Lookbehind in Java
4.Java中Lookbehind的局限性
Up until Java 8, we might run into the limitation that unbound quantifiers, like + and *, are not allowed within a lookbehind assertion. That is to say, for example, the following assertions will throw PatternSyntaxException up until Java 8:
直到Java 8,我们可能会遇到这样的限制:在lookbehind断言中不允许使用非绑定的量词,如+和*。也就是说,例如,在 Java 8 之前,以下断言将抛出 PatternSyntaxException:。
- (?<!fo+)bar, where we don’t want to match bar if fo with one or more o characters come before it
- (?<!fo*)bar, where we don’t want to match bar if it is preceded by an f character followed by zero or more o characters
- (?<!fo{2,})bar, where we don’t want to match bar if foo with two or more o characters come before it
As a workaround, we might use a curly braces quantifier with a specified upper limit, for example (?<!fo{2,4})bar, where we maximize the number of o characters following the f character to 4.
作为一种变通方法,我们可以使用带有指定上限的大括号量词,例如(?<!fo{2,4})bar,我们将o字符后的f的数量最大化为4。
Since Java 9, we can use unbound quantifiers in lookbehinds. However, because of the memory consumption of the regex implementation, it is still recommended to only use quantifiers in lookbehinds with a sensible upper limit, for example (?<!fo{2,20})bar instead of (?<!fo{2,2000})bar.
自Java 9以来,我们可以在lookbehinds中使用非绑定的量词。然而,由于regex实现的内存消耗,我们仍然建议只在lookbehinds中使用具有合理上限的量词,例如(?<!fo{2,20})bar而不是(?<!fo{2,2000})bar。
5. Positive Lookbehind
5.积极地看后方
Let’s say we’d like to differentiate between JUnit 4 and JUnit 5 imports in our analysis. First, let’s check if an import statement for the assertEquals method is from the jupiter package.
假设我们想在分析中区分JUnit 4和JUnit 5的导入。首先,让我们检查assertEquals方法的导入语句是否来自jupiter包。
Let’s use a positive lookbehind assertion with the (?<=criteria) syntax in our expression to match the character group jupiter before our main expression .*assertEquals:
让我们在表达式中使用带有(?<=criteria)语法的正向lookbehind断言,在我们的主表达式.*assertEquals之前匹配字符组jupiter。
Pattern pattern = Pattern.compile(".*(?<=jupiter).*assertEquals;");
Matcher matcher = pattern
.matcher("import static org.junit.jupiter.api.Assertions.assertEquals;");
assertTrue(matcher.find());
assertEquals("import static org.junit.jupiter.api.Assertions.assertEquals;", matcher.group());
assertFalse(pattern.matcher("import static org.junit.Assert.assertEquals;").find());
6. Negative Lookbehind
6.负面观察
Next, let’s do the direct opposite of the previous example and look for import statements that are not from the jupiter package.
接下来,让我们做与前面的例子直接相反的事情,寻找不是来自jupiter包的导入语句。
To do this, let’s use a negative lookbehind assertion with the (?<!criteria) syntax in our expression to ensure that the group of characters jupiter.{0,30} cannot match before our main expression assertEquals:
要做到这一点,让我们在表达式中使用带有(?<!criteria)语法的负向查找断言,以确保字符组 jupiter.{0,30}不能在我们的主表达式assertEquals前匹配。
Pattern pattern = Pattern.compile(".*(?<!jupiter.{0,30})assertEquals;");
Matcher matcher = pattern.matcher("import static org.junit.Assert.assertEquals;");
assertTrue(matcher.find());
assertEquals("import static org.junit.Assert.assertEquals;", matcher.group());
assertFalse(pattern
.matcher("import static org.junit.jupiter.api.Assertions.assertEquals;").find());
7. Conclusion
7.结语
In this article, we’ve seen how to use the four types of regex lookaround to solve some difficult cases of matching strings with regex.
在这篇文章中,我们已经看到如何使用四种类型的regex lookaround来解决一些用regex匹配字符串的困难情况。。
As always, the source code for this article is available over on GitHub.
一如既往,本文的源代码可在GitHub上获得over。