Non-Capturing Regex Groups in Java – Java中的非捕获雷格组

最后修改: 2021年 6月 13日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Non-capturing groups are important constructs within Java Regular Expressions. They create a sub-pattern that functions as a single unit but does not save the matched character sequence. In this tutorial, we’ll explore how to use non-capturing groups in Java Regular Expressions.

非捕获组是Java正则表达式中的重要构造。它们创建了一个子模式,作为一个单一的单元发挥作用,但不保存匹配的字符序列。在本教程中,我们将探讨如何在Java正则表达式中使用非捕获组。

2. Regular Expression Groups

2.正则表达式组

Regular expression groups can be one of two types: capturing and non-capturing.

常规表达组可以是两种类型之一:捕获型和非捕获型。

Capturing groups save the matched character sequence. Their values can be used as backreferences in the pattern and/or retrieved later in code.

抓取组保存匹配的字符序列。它们的值可以作为模式中的反向参考和/或在以后的代码中被检索。

Although they don’t save the matched character sequence, non-capturing groups can alter pattern matching modifiers within the group. Some non-capturing groups can even discard backtracking information after a successful sub-pattern match.

尽管它们不保存匹配的字符序列,非捕获组可以改变组内的模式匹配修改器。一些非捕获组甚至可以在一个成功的子模式匹配后丢弃回溯信息。

Let’s explore some examples of non-capturing groups in action.

让我们来探讨一些非捕获组的实际例子。

3. Non-Capturing Groups

3.非抓捕组

A non-capturing group is created with the operator (?:X)“. The “X” is the pattern for the group:

一个非捕获组是用操作符(?:X)“创建的。X“是组的模式:

Pattern.compile("[^:]+://(?:[.a-z]+/?)+")

This pattern has a single non-capturing group. It will match a value if it is URL-like. A full regular expression for a URL would be much more involved. We’re using a simple pattern to focus on non-capturing groups.

这个模式有一个单一的非捕获组。它将匹配一个类似于URL的值。一个完整的URL正则表达式将涉及更多的内容。我们正在使用一个简单的模式来关注非捕获组。

The pattern “[^:]:” matches the protocol — for example, “http://“. The non-capturing group “(?:[.a-z]+/?)” matches the domain name with an optional slash. Since the “+” operator matches one or more occurrences of this pattern, we’ll match the subsequent path segments as well. Let’s test this pattern on a URL:

模式“[^:]:“匹配协议 – 例如,”http://“。非捕获组”(?:[.a-z]+/?)“匹配带有斜线的域名。由于”+“操作符匹配该模式的一个或多个出现,我们也将匹配随后的路径段。让我们在一个URL上测试这个模式:

Pattern simpleUrlPattern = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher
  = simpleUrlPattern.matcher("http://www.microsoft.com/some/other/url/path");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();

Let’s see what happens when we try to retrieve the matched text:

让我们看看当我们试图检索匹配的文本时会发生什么:

Pattern simpleUrlPattern = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher = simpleUrlPattern.matcher("http://www.microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();
Assertions.assertThatThrownBy(() -> urlMatcher.group(1))
  .isInstanceOf(IndexOutOfBoundsException.class);

The regular expression is compiled into a java.util.Pattern object. Then, we create a java.util.Matcher to apply our Pattern to the provided value.

正则表达式被编译成一个java.util.Pattern对象。然后,我们创建一个java.util.Matcher来将我们的Pattern应用到提供的值上。

Next, we assert that the result of matches() returns true.

接下来,我们断言matches()的结果返回true

We used a non-capturing group to match the domain name in the URL. Since non-capturing groups do not save matched text, we cannot retrieve the matched text “www.microsoft.com/”. Attempting to retrieve the domain name will result in an IndexOutOfBoundsException.

我们使用了一个非捕获组来匹配URL中的域名。由于非捕获组不保存匹配的文本,我们不能检索匹配的文本“www.microsoft.com/”试图检索域名将导致IndexOutOfBoundsException

3.1. Inline Modifiers

3.1 内联修改器

Regular expressions are case-sensitive. If we apply our pattern to a mixed-case URL, the match will fail:

常规表达式是区分大小写的。如果我们将我们的模式应用于混合大小写的URL,匹配将失败。

Pattern simpleUrlPattern
  = Pattern.compile("[^:]+://(?:[.a-z]+/?)+");
Matcher urlMatcher
  = simpleUrlPattern.matcher("http://www.Microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isFalse();

In the case where we want to match uppercase letters as well, there are a few options we could try.

在我们也想匹配大写字母的情况下,有几个选项我们可以尝试。

One option is to add the uppercase character range to the pattern:

一种选择是将大写字母范围添加到模式中。

Pattern.compile("[^:]+://(?:[.a-zA-Z]+/?)+")

Another option is to use modifier flags. So, we can compile the regular expression to be case-insensitive:

另一个选择是使用修改器标志。因此,我们可以将正则表达式编译为不区分大小写的。

Pattern.compile("[^:]+://(?:[.a-z]+/?)+", Pattern.CASE_INSENSITIVE)

Non-capturing groups allow for a third option: We can change the modifier flags for just the group. Let’s add the case-insensitive modifier flag (“i“) to the group:

非捕获组允许有第三个选择。我们可以只为组改变修饰标志。让我们为组添加不区分大小写的修饰标志(”i“)。

Pattern.compile("[^:]+://(?i:[.a-z]+/?)+");

Now that we’ve made the group case-insensitive, let’s apply this pattern to a mixed-case URL:

现在我们已经使这个组不区分大小写,让我们把这个模式应用于混合大小写的URL。

Pattern scopedCaseInsensitiveUrlPattern
  = Pattern.compile("[^:]+://(?i:[.a-z]+/?)+");
Matcher urlMatcher
  = scopedCaseInsensitiveUrlPattern.matcher("http://www.Microsoft.com/");
    
Assertions.assertThat(urlMatcher.matches()).isTrue();

When a pattern is compiled to be case-insensitive, we can turn it off by adding the “-” operator in front of the modifier. Let’s apply this pattern to another mixed-case URL:

当一个模式被编译为不区分大小写时,我们可以通过在修饰符前面添加”-“操作符来关闭它。让我们将这个模式应用于另一个混合大小写的URL。

Pattern scopedCaseSensitiveUrlPattern
  = Pattern.compile("[^:]+://(?-i:[.a-z]+/?)+/ending-path", Pattern.CASE_INSENSITIVE);
Matcher urlMatcher
  = scopedCaseSensitiveUrlPattern.matcher("http://www.Microsoft.com/ending-path");
  
Assertions.assertThat(urlMatcher.matches()).isFalse();

In this example, the final path segment “/ending-path” is case-insensitive. The “/ending-path” portion of the pattern will match uppercase and lowercase characters.

在这个例子中,最后的路径段”/end-path“是不分大小写的。模式的”/ending-path“部分将匹配大写和小写字符。

When we turned off the case-insensitive option within the group, the non-capturing group only supported lowercase characters. Therefore, the mixed-case domain name did not match.

当我们关闭了组内的大小写不敏感选项时,非捕获组只支持小写字符。因此,混合大小写的域名并不匹配。

4. Independent Non-Capturing Groups

4.独立的非抓捕小组

Independent non-capturing groups are a type of regular expression group. These groups discard backtracking information after finding a successful match. When using this type of group, we need to be aware of when backtracking can occur. Otherwise, our patterns may not match the values we think they should.

独立非捕获组是正则表达式组的一种类型。这些组在找到一个成功的匹配后会丢弃回溯信息。当使用这种类型的组时,我们需要注意何时会发生回溯。否则,我们的模式可能无法匹配我们认为应该匹配的值。

Backtracking is a feature of Nondeterministic Finite Automaton (NFA) regular expression engines. When the engine fails to match text, the NFA engine can explore alternatives in the pattern. The engine will fail the match after exhausting all available alternatives. We only cover backtracking as it relates to independent non-capturing groups.

回溯是非确定性有限自动机(NFA)正则表达式引擎的一项功能。当引擎无法匹配文本时,NFA引擎可以探索模式中的替代方案。在用尽所有可用的替代方案后,引擎将无法进行匹配。我们只涉及回溯,因为它与独立的非捕获组有关。

An independent non-capturing group is created with the operator “(?>X)” where X is the sub-pattern:

一个独立的非捕获组是用操作符”(?>X)“创建的,其中X是子模式。

Pattern.compile("[^:]+://(?>[.a-z]+/?)+/ending-path");

Pattern.compile("[^:]+://(?>[.a-z]+/?)+/ending-path");/code>

We have added “/ending-path” as a constant path segment. Having this additional requirement forces a backtracking situation. The domain name and other path segments can match the slash character. To match “/ending-path”, the engine will need to backtrack. By backtracking, the engine can remove the slash from the group and apply it to the “/ending-path” portion of the pattern.

我们增加了”/end-path“作为一个固定的路径段。有了这个额外的要求,就会出现回溯的情况。域名和其他路径段可以匹配斜线字符。为了匹配“/end-path”,引擎将需要进行回溯。通过回溯,引擎可以从组中移除斜线,并将其应用于模式的”/end-path“部分。

Let’s apply our independent non-capturing group pattern to a URL:

让我们把我们的独立非捕获组模式应用于一个URL。

Pattern independentUrlPattern
  = Pattern.compile("[^:]+://(?>[.a-z]+/?)+/ending-path");
Matcher independentMatcher
  = independentUrlPattern.matcher("http://www.microsoft.com/ending-path");
    
Assertions.assertThat(independentMatcher.matches()).isFalse();

The group matches the domain name and the slash successfully. So, we leave the scope of the independent non-capturing group.

该组与域名和斜线成功匹配。所以,我们离开了独立的非抓取组的范围。

This pattern requires a slash to appear before “ending-path”. However, our independent non-capturing group has matched the slash.

这个模式要求在“ending-path”之前出现一个斜线。然而,我们的独立非抓取组已经匹配了斜线。

The NFA engine should try backtracking. Since the slash is optional at the end of the group, the NFA engine would remove the slash from the group and try again. The independent non-capturing group has discarded the backtracking information. So, the NFA engine cannot backtrack.

NFA引擎应该尝试回溯。由于斜线在组的末尾是可选的,NFA引擎会将斜线从组中移除,并再次尝试。独立的非抓取组已经丢弃了回溯的信息。所以,NFA引擎不能反追踪。

4.1. Backtracking Inside the Group

4.1.在集团内部进行回溯

Backtracking can occur within an independent non-capturing group. While the NFA engine is matching the group, the backtracking information has not been discarded. The backtracking information is not discarded until after the group matches successfully:

在一个独立的非捕获组内可能会发生反追踪。当NFA引擎正在匹配该组时,反追踪信息还没有被丢弃。直到该组匹配成功后,回溯信息才会被丢弃:

Pattern independentUrlPatternWithBacktracking
  = Pattern.compile("[^:]+://(?>(?:[.a-z]+/?)+/)ending-path");
Matcher independentMatcher
  = independentUrlPatternWithBacktracking.matcher("http://www.microsoft.com/ending-path");
    
Assertions.assertThat(independentMatcher.matches()).isTrue();

Now we have a non-capturing group within an independent non-capturing group. We still have a backtracking situation involving the slash in front of “ending-path”. However, we have enclosed the backtracking portion of the pattern inside of the independent non-capturing group. The backtracking will occur within the independent non-capturing group. Therefore the NFA engine has enough information to backtrack, and the pattern matches the provided URL.

现在我们有一个独立的非捕获组内的非捕获组。我们仍然有一个涉及“结束路径”前面的斜线的回溯情况。然而,我们已经将模式的回溯部分包围在独立的非捕获组内。回溯将发生在独立的非捕获组中。因此,NFA引擎有足够的信息进行回溯,并且该模式与提供的URL相匹配。

5. Conclusion

5.总结

We’ve shown that non-capturing groups are different from capturing groups. However, they function as a single unit like their capturing counterparts. We have also shown that non-capturing groups can enable or disable the modifiers for the group instead of the pattern as a whole.

我们已经表明,非抓捕组与抓捕组是不同的。然而,它们的功能就像它们的捕获组一样,是一个单一的单元。我们还表明,非捕获组可以启用或禁用该组的修改器,而不是整个模式

Similarly, we’ve shown how independent non-capturing groups discard backtracking information. Without this information, the NFA engine cannot explore alternatives to make a successful match. However, backtracking can occur within the group.

同样地,我们已经展示了独立的非捕获组是如何丢弃回溯信息的。如果没有这些信息,NFA引擎就无法探索替代方案来进行成功的匹配。然而,回溯可以在组内发生。

As always, the source code is available over on GitHub.

像往常一样,源代码可在GitHub上获得