Finding the N-th Occurrence of a Substring in a String in Java – 用 Java 查找字符串中第 N 次出现的子串

最后修改: 2023年 11月 7日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Locating a substring in a larger string is a common operation when we work with Java. Usually, we can find a substring’s index using the indexOf() method.

在较大的字符串中查找子串是我们使用 Java 时的常见操作。通常,我们可以使用 indexOf() 方法查找子串的索引。

In this tutorial, we’ll explore various approaches to solve an interesting and pragmatic problem: finding the n-th occurrence of a substring in a larger string.

在本教程中,我们将探索各种方法来解决一个有趣而实用的问题:在一个较大的字符串中查找第 n 次出现的子串。

2. Introduction to the Problem

2.问题介绍

The standard indexOf() method can give us the index of a substring in a string. For example, we can find the index (8) of the substring “a” in “This is a word.“:

标准的 indexOf() 方法可以为我们提供字符串中子串的索引。例如,我们可以找到”This is a word.“中子串 “a” 的索引 (8):

int firstIdx = "This is a word.".indexOf("a");
assertEquals(8, firstIdx);

However, the indexOf(substring) method always returns the index of the substring’s first occurrence. Therefore, sometimes, it’s inconvenient when we want to know a substring’s n-th occurrence index using this method. Let’s see an example:

但是,indexOf(子串)方法总是返回子串第一次出现的索引。因此,有时当我们想知道子串的 n 次出现索引时,使用该方法会很不方便。让我们来看一个例子:

final static String INPUT = "a word, a word, a word, a word";
// indexes of "a":          "0       8       16      24 "

As the example above shows, INPUT contains four “a“s. We can obtain the first “a” index (0) by calling indexOf(“a”) directly. However, we may need solutions to get other indexes of “a”‘s occurrences, such as 8, 16, and 24.

如上例所示,INPUT 包含四个”a“。我们可以通过直接调用 indexOf(“a”) 获得第一个 “a “索引(0)。但是,我们可能需要一些解决方案来获取 “a “出现的其他索引,例如 8、16 和 24。

Next, let’s see how to solve this problem.

接下来,让我们看看如何解决这个问题。

For simplicity, we’ll leverage unit test assertions to verify whether each approach returns the expected result.

为了简单起见,我们将利用单元测试断言来验证每种方法是否会返回预期结果。

3. Finding the Substring’s Second Occurrence

3.查找子串的第二次出现

Before we solve the “finding the n-th occurrence” problem, let’s first find the index of the second (n=2) occurrence of the substring “a”. Then, we’ll extend the “find the second occurrence” solution to various “the n-th occurrence” solutions.

在解决 “查找第 n 次出现 “问题之前,让我们先查找子串 “a “的第二次(n=2)出现的索引。然后,我们将把 “查找第二次出现 “的解法扩展到各种 “第 n 次出现 “的解法。

We’ve learned that the indexOf(“a”) returns the index of the first occurrence of “a”. Further, we can pass the fromIndex int parameter to the indexOf() method. The fromIndex parameter indicates the character index we start searching from. 

我们已经知道,indexOf(“a”) 返回的是 “a “第一次出现时的索引。此外,我们还可以向 indexOf() 方法传递 fromIndex int 参数。fromIndex参数表示我们开始搜索的字符索引。

Therefore, a straightforward idea to get the second occurrence’s index is calling indexOf() twice:

因此,获取第二次出现的索引的直接方法是调用 indexOf() 两次:

  • Get the index of the first occurrence by calling indexOf(substring) and save the result in a variable, let’s say firstIdx
  • Get the index of the second occurrence by calling indexOf(substring, firstIdx + substring.length())

Next, let’s implement this approach and test with our INPUT string:

接下来,让我们使用 INPUT 字符串来实现这种方法并进行测试:

int firstIdx = INPUT.indexOf("a");
int secondIdx = INPUT.indexOf("a", firstIdx + "a".length());
assertEquals(8, secondIdx);

Now, some of us might have realized we can call indexOf() n times with the corresponding fromIdx parameter to get the index of the n-th occurrence. For example, we can add another indexOf() call to the test above to get the third occurrence’s index: thirdIdx = INPUT.indexOf(“a”, secondIdx + “a”.length());.

现在,有些人可能已经意识到 我们可以调用 indexOf() n 次,并使用相应的 fromIdx 参数来获取第 n 次出现的索引。例如,我们可以在上述测试中添加另一次 indexOf() 调用,以获取第三次出现的索引:thirdIdx = INPUT.indexOf(“a”, secondIdx + “a”.length());.

So next, let’s extend the “finding the second occurrence” solution to “finding the n-th occurrence”.

接下来,让我们把 “找到第二次出现 “的解决方案扩展到 “找到第 n 次出现”。

4. Finding the Substring’s N-th Occurrence

4.查找子串的 N 次出现频率

Usually, we’d use recursion or looping to implement repeated operations.

通常,我们会使用递归或循环来实现重复操作。

4.1. The Recursive Approach

4.1.递归方法

So, let’s first implement the recursive solution:

因此,让我们先来实现递归解决方案:

int nthIndexOf(String input, String substring, int nth) {
    if (nth == 1) {
        return input.indexOf(substring);
    } else {
        return input.indexOf(substring, nthIndexOf(input, substring, nth - 1) + substring.length());
    }
}

The implementation is pretty straightforward. The nth variable works as a counter. We reduce its value in each recursion step.

实现方法非常简单。nth变量用作计数器。我们在每个递归步骤中减少它的值。

Next, let’s test the method with our example data:

接下来,让我们用示例数据对该方法进行测试:

int result1 = nthIndexOf(INPUT, "a", 1);
assertEquals(0, result1);

int result2 = nthIndexOf(INPUT, "a", 2);
assertEquals(8, result2);

int result3 = nthIndexOf(INPUT, "a", 3);
assertEquals(16, result3);

int result4 = nthIndexOf(INPUT, "a", 4);
assertEquals(24, result4);

int result5 = nthIndexOf(INPUT, "a", 5);
assertEquals(-1, result5);

The test passes if we give it a run. Also, as we can see, when the total occurrence count is less than the nth value, the method returns -1.

如果我们运行该方法,测试就会通过。此外,我们还可以看到,当总出现次数小于 nth 值时,该方法会返回 -1

4.2. The Iterative Approach

4.2.迭代法

Similarly, we can implement the same idea in an iterative approach:

同样,我们也可以用迭代法来实现同样的想法:

static int nthIndexOf2(String input, String substring, int nth) {
    int index = -1;
    while (nth > 0) {
        index = input.indexOf(substring, index + substring.length());
        if (index == -1) {
            return -1;
        }
        nth--;
    }
    return index;
}

The test with the same input passes as well:

同样输入的测试也通过了:

int result1 = nthIndexOf2(INPUT, "a", 1);
assertEquals(0, result1);

int result2 = nthIndexOf2(INPUT, "a", 2);
assertEquals(8, result2);

int result3 = nthIndexOf2(INPUT, "a", 3);
assertEquals(16, result3);

int result4 = nthIndexOf2(INPUT, "a", 4);
assertEquals(24, result4);

int result5 = nthIndexOf2(INPUT, "a", 5);
assertEquals(-1, result5);

5. The Regex-Based Solution

5.基于 Regex 的解决方案

We’ve seen how to solve the problem using the indexOf() method. Alternatively, we can use Java’s Regex API to solve the problem.

我们已经了解了如何使用 indexOf() 方法来解决这个问题。另外,我们还可以使用 Java 的 Regex API 来解决问题。

Matcher.find() allows us to find the next matched occurrence in the input string. Therefore, we can call Matcher.find() n times to get the n-th match. Also, we can get each match’s start index using Matcher.start():

Matcher.find()允许我们查找输入字符串中的下一个匹配项。因此,我们可以调用 Matcher.find() n 次,以获得第 n 个匹配项。此外,我们还可以使用 Matcher.start() 获得每个匹配的起始索引

int nthOccurrenceIndex(String input, String regexPattern, int nth) {
    Matcher matcher = Pattern.compile(regexPattern).matcher(INPUT);
    for (int i = 0; i < nth; i++) {
        if (!matcher.find()) {
            return -1;
        }
    }
    return matcher.start();
}

Next, let’s create a test to verify whether the regex-based solution does the job correctly:

接下来,让我们创建一个测试,以验证基于 regex 的解决方案是否正确:

int result1 = nthOccurrenceIndex(INPUT, "a", 1);
assertEquals(0, result1);

int result2 = nthOccurrenceIndex(INPUT, "a", 2);
assertEquals(8, result2);

int result3 = nthOccurrenceIndex(INPUT, "a", 3);
assertEquals(16, result3);

int result4 = nthOccurrenceIndex(INPUT, "a", 4);
assertEquals(24, result4);

int result5 = nthOccurrenceIndex(INPUT, "a", 5);
assertEquals(-1, result5);

It’s worth noting that this solution allows us to match dynamic substrings matching a pattern in the input. However, on the other hand, the indexOf() based approach only works with fixed substrings.

值得注意的是,这种解决方案允许我们匹配与输入中的模式匹配的动态子串。但另一方面,基于 indexOf() 的方法只适用于固定子串。

6. Conclusion

6.结论

In this article, we’ve learned various ways to locate the n-th occurrence of a substring within a string:

在本文中,我们学习了查找字符串中第 n 次出现的子串的各种方法:

  • The recursive solution based on the indexOf() method
  • The iterative solution based on the indexOf() method
  • Regex-based solution

As always, the complete source code for the examples is available over on GitHub.

与往常一样,这些示例的完整源代码可在 GitHub 上获取。