Get the Indexes of Regex Pattern Matches in Java – 在 Java 中获取 Regex 模式匹配的索引

最后修改: 2023年 8月 26日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述</span

In Java programming, dealing with strings and patterns is essential to many applications. Regular expressions, commonly known as regex, provide a powerful tool for pattern matching and manipulation.

在 Java 编程中,处理字符串和模式对许多应用程序都至关重要。正则表达式(通常称为 regex)为模式匹配和操作提供了强大的工具。</span

Sometimes, we not only need to identify matches within a string but also locate exactly where these matches occur. In this tutorial, we’ll explore getting the indexes of regex pattern matches in Java.

有时,我们不仅需要识别字符串中的匹配项,还需要准确定位这些匹配项出现的位置。在本教程中,我们将探讨如何在 Java 中获取 regex 模式匹配的索引。

2. Introduction to the Problem

2.问题介绍

Let’s start with a String example:

让我们从 String 示例开始:

String INPUT = "This line contains <the first value>, <the second value>, and <the third value>.";

Let’s say we want to extract all “<…>” segments from the string above, such as “<the first value>” and “<the second value>“.

假设我们要从上面的字符串中提取所有 “<…>” 段,例如”<第一个值>“和”<第二个值>“。

To match these segments, we can use regex’s NOR character classes: “<[^>]*>”. 

要匹配这些片段,我们可以使用 regex 的 NOR 字符类“<[^>]*>”。</em

In Java, the Pattern and Matcher classes from the Regex API are important tools for working with pattern matching. These classes provide methods to compile regex patterns and apply them to strings for various operations.

在 Java 中,Regex API 中的 PatternMatcher 类是处理模式匹配的重要工具。这些类提供了编译 regex 模式并将其应用到字符串以进行各种操作的方法。

So next, let’s use Pattern and Matcher to extract the desired text. For simplicity, we’ll use AssertJ assertions to verify whether we obtained the expected result:

接下来,让我们使用 PatternMatcher 来提取所需的文本。为了简单起见,我们将使用 AssertJ 断言来验证我们是否获得了预期的结果:

Pattern pattern = Pattern.compile("<[^>]*>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
while (matcher.find()) {
    result.add(matcher.group());
}
assertThat(result).containsExactly("<the first value>", "<the second value>", "<the third value>");

As the code above shows, we extracted all “<…>” parts from the input String. However, sometimes, we want to know exactly where matches are located in the input. In other words, we want to obtain the matches and their indexes in the input string.

如上面的代码所示,我们从输入 String 中提取了所有”<…>“部分。然而,有时我们想知道匹配项在输入中的确切位置。换句话说,我们希望获得输入字符串中的匹配项及其索引。

Next, let’s extend this code to achieve our goals.

接下来,让我们扩展这段代码,以实现我们的目标。

3. Obtaining Indexes of Matches

3.获取匹配索引

We’ve used the Matcher class to extract the matches. The Matcher class offers two methods, start() and end(), which allow us to obtain each match’s start and end indexes. 

我们使用 Matcher 类提取匹配。Matcher类提供了两个方法:start()end(),它们允许我们获取每个匹配的开始和结束索引。 <br

It’s worth noting that the Matcher.end() method returns the index after the last character of the matched subsequence. An example can show this clearly:

值得注意的是,Matcher.end() 方法返回匹配子序列最后一个字符后的索引。一个例子可以清楚地说明这一点:

Pattern pattern = Pattern.compile("456");
Matcher matcher = pattern.matcher("0123456789");
String result = null;
int startIdx = -1;
int endIdx = -1;
if (matcher.find()) {
    result = matcher.group();
    startIdx = matcher.start();
    endIdx = matcher.end();
}
assertThat(result).isEqualTo("456");
assertThat(startIdx).isEqualTo(4);
assertThat(endIdx).isEqualTo(7); // matcher.end() returns 7 instead of 6

Now that we understand what start() and end() return, let’s see if we can obtain the indexes of each matched “<…>” subsequence in our INPUT:

现在,我们了解了 start()end() 的返回值,让我们看看能否获得 INPUT 中每个匹配的 “<…>” 子序列的索引:</em

Pattern pattern = Pattern.compile("<[^>]*>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
Map<Integer, Integer> indexesOfMatches = new LinkedHashMap<>();
while (matcher.find()) {
    result.add(matcher.group());
    indexesOfMatches.put(matcher.start(), matcher.end());
}
assertThat(result).containsExactly("<the first value>", "<the second value>", "<the third value>");
assertThat(indexesOfMatches.entrySet()).map(entry -> INPUT.substring(entry.getKey(), entry.getValue()))
  .containsExactly("<the first value>", "<the second value>", "<the third value>");

As the test above shows, we stored each match’s start() and end() results in a LinkedHashMap to preserve the insertion order. Then, we extracted substrings from the original input by these index pairs. If we obtained the correct indexes, these substrings must equal the matches.

如上测试所示,我们将每个匹配的 start()end() 结果存储在一个 LinkedHashMap 中,以保留插入顺序。然后,我们通过这些索引对从原始输入中提取子串。如果我们获得了正确的索引,这些子串必须等于匹配的内容。

If we give this test a run, it passes.

如果我们进行一次测试,它就会通过。

4. Obtaining Indexes of Matches With Capturing Groups

4.通过捕捉组获得匹配索引

In regex, capturing groups play a crucial role by allowing us to reference them later or conveniently extract sub-patterns.

在 regex 中,捕获组 发挥着至关重要的作用,它允许我们稍后引用它们或方便地提取子模式。

To illustrate, suppose we aim to extract the content enclosed between ‘<‘ and ‘>‘. In such cases, we can create a pattern that incorporates a capturing group: “<([^>]*)>”. As a result, when utilizing Matcher.group(1), we obtain the text “the first value“,  “the second value“, and so on.

例如,假设我们的目标是提取”<“和”>“之间的内容。在这种情况下,我们可以创建一个包含捕获组的模式:“<([^>]*)>”。因此,当使用 Matcher.group(1) 时, 我们将获得文本”第一个值“、”第二个值“,以此类推。

When no explicit capturing group is defined, the entire regex pattern assumes the default group with the index 0. Therefore, invoking Matcher.group() is synonymous with calling Matcher.group(0).

如果没有定义明确的捕获分组,则整个 regex 模式会假定索引为 0 的默认分组。因此,调用 Matcher.group() 与调用 Matcher.group(0) 是同义词。

Much like the behavior of the Matcher.group() function, the Matcher.start() and Matcher.end() methods offer support for specifying a group index as an argument. Consequently, these methods provide the starting and ending indexes corresponding to the matched content within the corresponding group:

Matcher.group() 函数的行为非常相似,Matcher.start()Matcher.end() 方法支持将组索引指定为参数。因此,这些方法提供了与相应组内匹配内容相对应的开始和结束索引:

Pattern pattern = Pattern.compile("<([^>]*)>");
Matcher matcher = pattern.matcher(INPUT);
List<String> result = new ArrayList<>();
Map<Integer, Integer> indexesOfMatches = new LinkedHashMap<>();
while (matcher.find()) {
    result.add(matcher.group(1));
    indexesOfMatches.put(matcher.start(1), matcher.end(1));
}
assertThat(result).containsExactly("the first value", "the second value", "the third value");
assertThat(indexesOfMatches.entrySet()).map(entry -> INPUT.substring(entry.getKey(), entry.getValue()))
  .containsExactly("the first value", "the second value", "the third value");

5. Conclusion

5.结论</span

In this article, we explored obtaining the indexes of pattern matches within the original input when dealing with regex. We discussed scenarios involving patterns with and without explicitly defined capturing groups.

在本文中,我们探讨了在处理 regex 时如何获取原始输入中模式匹配的索引。我们讨论了包含和不包含明确定义的捕获组的模式。

As always, the complete source code for the examples is available over on GitHub.

一如既往,示例的完整源代码可在 GitHub 上获取 .