How to Use Regular Expressions to Replace Tokens in Strings in Java – 如何在Java中使用正则表达式来替换字符串中的标记

最后修改: 2020年 3月 8日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

When we need to find or replace values in a string in Java, we usually use regular expressions. These allow us to determine if some or all of a string matches a pattern. We might easily apply the same replacement to multiple tokens in a string with the replaceAll method in both Matcher and String.

当我们需要在 Java 中查找或替换字符串中的值时,我们通常使用regular expressions。这些允许我们确定一个字符串的部分或全部是否与模式匹配。我们可以通过MatcherString中的replaceAll方法轻松地对一个字符串中的多个标记进行同样的替换。

In this tutorial, we’ll explore how to apply a different replacement for each token found in a string. This will make it easy for us to satisfy use cases like escaping certain characters or replacing placeholder values.

在本教程中,我们将探讨如何对字符串中发现的每个标记应用不同的替换。这将使我们很容易满足诸如转义某些字符或替换占位符值的使用情况。

We’ll also look at a few tricks for tuning our regular expressions to identify tokens correctly.

我们还将研究一些技巧,以调整我们的正则表达式,使其正确识别标记。

2. Individually Processing Matches

2.单独处理匹配关系

Before we can build our token-by-token replacement algorithm, we need to understand the Java API around regular expressions. Let’s solve a tricky matching problem using capturing and non-capturing groups.

在我们建立我们的逐个令牌替换算法之前,我们需要了解围绕正则表达式的Java API。让我们使用捕获组和非捕获组来解决一个棘手的匹配问题。

2.1. Title Case Example

2.1.标题案例示例

Let’s imagine we want to build an algorithm to process all the title words in a string. These words start with one uppercase character and then either end or continue with only lowercase characters.

让我们设想一下,我们想建立一个算法来处理一个字符串中的所有标题词。这些词以一个大写字母开始,然后要么结束要么继续,只有小写字母。

Our input might be:

我们的投入可能是。

"First 3 Capital Words! then 10 TLAs, I Found"

From the definition of a title word, this contains the matches:

从标题词的定义来看,这包含了匹配的内容。

  • First
  • Capital
  • Words
  • I
  • Found

And a regular expression to recognize this pattern would be:

而识别这种模式的正则表达式是:。

"(?<=^|[^A-Za-z])([A-Z][a-z]*)(?=[^A-Za-z]|$)"

To understand this, let’s break it down into its component parts. We’ll start in the middle:

为了理解这一点,让我们把它分解成各个组成部分。我们将从中间开始。

[A-Z]

will recognize a single uppercase letter.

将识别一个大写字母。

We allow single-character words or words followed by lowercase, so:

我们允许单字符的单词或单词后面有小写字母,所以。

[a-z]*

recognizes zero or more lowercase letters.

识别零个或多个小写字母。

In some cases, the above two character classes would be enough to recognize our tokens. Unfortunately, in our example text, there is a word that starts with multiple capital letters. Therefore, we need to express that the single capital letter we find must be the first to appear after non-letters.

在某些情况下,上述两个字符类别就足以识别我们的标记。不幸的是,在我们的示例文本中,有一个词以多个大写字母开头。因此,我们需要表达我们找到的单个大写字母必须是在非字母之后第一个出现的。

Similarly, as we allow a single capital letter word, we need to express that the single capital letter we find must not be the first of a multi-capital letter word.

同样地,由于我们允许单大写字母词,我们需要表达我们找到的单大写字母不能是多大写字母词的第一个。

The expression [^A-Za-z] means “no letters”. We have put one of these at the start of the expression in a non-capturing group:

表达式[^A-Za-z] 意为 “没有字母”。我们把其中一个放在表达式的开头,放在一个非捕获组中。

(?<=^|[^A-Za-z])

The non-capturing group, starting with (?<=, does a look-behind to ensure the match appears at the correct boundary. Its counterpart at the end does the same job for the characters that follow.

非捕获组,以(?<=, 开始,做了一个后视,以确保匹配出现在正确的边界。

However, if words touch the very start or end of the string, then we need to account for that, which is where we’ve added ^| to the first group to make it mean “the start of the string or any non-letter characters”, and we’ve added |$ on the end of the last non-capturing group to allow the end of the string to be a boundary.

然而,如果单词触及到了字符串的开头或结尾,那么我们就需要考虑到这一点,因此我们在第一组中加入了^|,使其意味着 “字符串的开头或任何非字母字符”,并且我们在最后一个非捕捉组的结尾处加入了|$,允许字符串的结尾成为一个边界。

Characters found in non-capturing groups do not appear in the match when we use find.

当我们使用find时,在非抓取组中发现的字符不会出现在匹配中

We should note that even a simple use case like this can have many edge cases, so it’s important to test our regular expressions. For this, we can write unit tests, use our IDE’s built-in tools, or use an online tool like Regexr.

我们应该注意到,即使是这样一个简单的用例也会有许多边缘情况,因此测试我们的正则表达式很重要。为此,我们可以编写单元测试,使用IDE的内置工具,或者使用Regexr等在线工具。

2.2. Testing Our Example

2.2.测试我们的例子

With our example text in a constant called EXAMPLE_INPUT and our regular expression in a Pattern called TITLE_CASE_PATTERN, let’s use find on the Matcher class to extract all of our matches in a unit test:

在一个名为EXAMPLE_INPUT的常量中包含我们的示例文本,在一个名为TITLE_CASE_PATTERNPattern中包含我们的正则表达式,让我们在Matcher类上使用find来提取单元测试中所有的匹配项。

Matcher matcher = TITLE_CASE_PATTERN.matcher(EXAMPLE_INPUT);
List<String> matches = new ArrayList<>();
while (matcher.find()) {
    matches.add(matcher.group(1));
}

assertThat(matches)
  .containsExactly("First", "Capital", "Words", "I", "Found");

Here we use the matcher function on Pattern to produce a Matcher. Then we use the find method in a loop until it stops returning true to iterate over all the matches.

这里我们在Pattern上使用matcher函数来产生一个Matcher。然后我们在一个循环中使用find方法,直到它停止返回true,来迭代所有的匹配。

Each time find returns true, the Matcher object’s state is set to represent the current match. We can inspect the whole match with group(0) or inspect particular capturing groups with their 1-based index. In this case, there is a capturing group around the piece we want, so we use group(1) to add the match to our list.

每次find返回true时,Matcher对象的状态被设置为代表当前的匹配。我们可以用group(0)检查整个匹配,或者用基于1的索引检查特定的抓取组。在本例中,在我们想要的棋子周围有一个捕捉组,所以我们用group(1)来把这个匹配添加到我们的列表中。

2.3. Inspecting Matcher a Bit More

2.3.多检查一下Matcher

We’ve so far managed to find the words we want to process.

到目前为止,我们已经设法找到了我们想要处理的词语。

However, if each of these words were a token that we wanted to replace, we would need to have more information about the match in order to build the resulting string. Let’s look at some other properties of Matcher that might help us:

然而,如果这些词中的每一个都是我们想要替换的标记,我们就需要有更多关于匹配的信息,以便建立结果字符串。让我们看看Matcher的一些其他属性,它们可能会帮助我们。

while (matcher.find()) {
    System.out.println("Match: " + matcher.group(0));
    System.out.println("Start: " + matcher.start());
    System.out.println("End: " + matcher.end());
}

This code will show us where each match is. It also shows us the group(0) match, which is everything captured:

这段代码将向我们显示每个匹配的位置。它还会向我们显示group(0)匹配,也就是所有捕获的东西。

Match: First
Start: 0
End: 5
Match: Capital
Start: 8
End: 15
Match: Words
Start: 16
End: 21
Match: I
Start: 37
End: 38
... more

Here we can see that each match contains only the words we’re expecting. The start property shows the zero-based index of the match within the string. The end shows the index of the character just after. This means we could use substring(start, end-start) to extract each match from the original string. This is essentially how the group method does that for us.

在这里,我们可以看到每个匹配项只包含我们所期望的单词。start属性显示了字符串中匹配的零基索引end显示的是紧随其后的字符的索引。这意味着我们可以使用substring(start, end-start)来从原始字符串中提取每个匹配项。这基本上就是group方法为我们做的。

Now that we can use find to iterate over matches, let’s process our tokens.

现在我们可以使用find来迭代匹配,让我们来处理我们的令牌。

3. Replacing Matches One by One

3.逐一更换火柴

Let’s continue our example by using our algorithm to replace each title word in the original string with its lowercase equivalent. This means our test string will be converted to:

让我们继续我们的例子,使用我们的算法将原始字符串中的每个标题词替换为小写的对应词。这意味着我们的测试字符串将被转换为。

"first 3 capital words! then 10 TLAs, i found"

The Pattern and Matcher class can’t do this for us, so we need to construct an algorithm.

PatternMatcher类不能为我们做这个,所以我们需要构建一个算法。

3.1. The Replacement Algorithm

3.1.替换算法

Here is the pseudo-code for the algorithm:

以下是该算法的伪代码。

  • Start with an empty output string
  • For each match:
    • Add to the output anything that came before the match and after any previous match
    • Process this match and add that to the output
    • Continue until all matches are processed
    • Add anything left after the last match to the output

We should note that the aim of this algorithm is to find all non-matched areas and add them to the output, as well as adding the processed matches.

我们应该注意到,这个算法的目的是找到所有不匹配的区域并将其添加到输出中,以及添加经过处理的匹配区域。

3.2. The Token Replacer in Java

3.2.Java中的令牌替换器

We want to convert each word to lowercase, so we can write a simple conversion method:

我们想把每个字转换成小写字母,所以我们可以写一个简单的转换方法。

private static String convert(String token) {
    return token.toLowerCase();
}

Now we can write the algorithm to iterate over the matches. This can use a StringBuilder for the output:

现在我们可以编写算法来迭代匹配的内容。这可以使用一个StringBuilder来输出。

int lastIndex = 0;
StringBuilder output = new StringBuilder();
Matcher matcher = TITLE_CASE_PATTERN.matcher(original);
while (matcher.find()) {
    output.append(original, lastIndex, matcher.start())
      .append(convert(matcher.group(1)));

    lastIndex = matcher.end();
}
if (lastIndex < original.length()) {
    output.append(original, lastIndex, original.length());
}
return output.toString();

We should note that StringBuilder provides a handy version of append that can extract substrings. This works well with the end property of Matcher to let us pick up all non-matched characters since the last match.

我们应该注意,StringBuilder提供了一个方便的append版本,可以提取子串。这与Matcherend属性配合得很好,可以让我们提取自上一次匹配以来所有未匹配的字符。

4. Generalizing the Algorithm

4.算法的推广

Now that we’ve solved the problem of replacing some specific tokens, why don’t we convert the code into a form where it can be used for the general case? The only thing that varies from one implementation to the next is the regular expression to use, and the logic for converting each match into its replacement.

现在我们已经解决了替换某些特定标记的问题,为什么我们不把代码转换成可以用于一般情况的形式呢?在不同的实现中,唯一不同的是要使用的正则表达式,以及将每个匹配转换为替换的逻辑。

4.1. Use a Function and Pattern Input

4.1.使用一个函数和模式输入

We can use a Java Function<Matcher, String> object to allow the caller to provide the logic to process each match. And we can take an input called tokenPattern to find all the tokens:

我们可以使用一个Java Function<Matcher, String>对象,让调用者提供逻辑来处理每个匹配。而且我们可以接受一个名为tokenPattern的输入,以找到所有的令牌。

// same as before
while (matcher.find()) {
    output.append(original, lastIndex, matcher.start())
      .append(converter.apply(matcher));

// same as before

Here, the regular expression is no longer hard-coded. Instead, the converter function is provided by the caller and is applied to each match within the find loop.

这里,正则表达式不再是硬编码的。相反,converter函数由调用者提供,并被应用于find循环中的每个匹配。

4.2. Testing the General Version

4.2.测试普通版本

Let’s see if the general method works as well as the original:

让我们看看一般的方法是否和原来一样有效。

assertThat(replaceTokens("First 3 Capital Words! then 10 TLAs, I Found",
  TITLE_CASE_PATTERN,
  match -> match.group(1).toLowerCase()))
  .isEqualTo("first 3 capital words! then 10 TLAs, i found");

Here we see that calling the code is straightforward. The conversion function is easy to express as a lambda. And the test passes.

这里我们看到,调用代码是很直接的。转换函数很容易表达为一个lambda。而且测试通过了。

Now we have a token replacer, so let’s try some other use cases.

现在我们有了一个令牌替换器,所以让我们尝试一些其他的用例。

5. Some Use Cases

5.一些使用案例

5.1. Escaping Special Characters

5.1.转移特殊字符

Let’s imagine we wanted to use the regular expression escape character \ to manually quote each character of a regular expression rather than use the quote method. Perhaps we are quoting a string as part of creating a regular expression to pass to another library or service, so block quoting the expression won’t suffice.

假设我们想使用正则表达式转义字符来手动引用正则表达式的每个字符,而不是使用quote方法。也许我们正在引用一个字符串作为创建正则表达式的一部分,以传递给另一个库或服务,所以对表达式进行块状引用是不够的。

If we can express the pattern that means “a regular expression character”, it’s easy to use our algorithm to escape them all:

如果我们能表达出意味着 “一个正则表达式字符 “的模式,就很容易用我们的算法将它们全部转义。

Pattern regexCharacters = Pattern.compile("[<(\\[{\\\\^\\-=$!|\\]})?*+.>]");

assertThat(replaceTokens("A regex character like [",
  regexCharacters,
  match -> "\\" + match.group()))
  .isEqualTo("A regex character like \\[");

For each match, we prefix the \ character. As \ is a special character in Java strings, it’s escaped with another \.

对于每一个匹配,我们都要给字符加前缀。由于\是Java字符串中的一个特殊字符,它被另一个\转义。

Indeed, this example is covered in extra \ characters as the character class in the pattern for regexCharacters has to quote many of the special characters. This shows the regular expression parser that we’re using them to mean their literals, not as regular expression syntax.

事实上,这个例子被额外的字符所覆盖,因为regexCharacters的模式中的字符类必须引用许多特殊字符。这向正则表达式分析器表明,我们使用它们是指它们的字面意义,而不是作为正则表达式的语法。

5.2. Replacing Placeholders

5.2.替换占位符

A common way to express a placeholder is to use a syntax like ${name}. Let’s consider a use case where the template “Hi ${name} at ${company}” needs to be populated from a map called placeholderValues:

表达占位符的一种常见方式是使用类似${name}的语法。让我们考虑一个用例,模板“Hi ${name} at ${company}”需要从一个名为placeholderValues的地图中填充。

Map<String, String> placeholderValues = new HashMap<>();
placeholderValues.put("name", "Bill");
placeholderValues.put("company", "Baeldung");

All we need is a good regular expression to find the ${…} tokens:

我们需要的是一个好的正则表达式来找到${…}标记。

"\\$\\{(?<placeholder>[A-Za-z0-9-_]+)}"

is one option. It has to quote the $ and the initial curly brace as they would otherwise be treated as regular expression syntax.

是一个选项。它必须引用$和最初的大括号,否则它们会被视为正则表达式语法。

At the heart of this pattern is a capturing group for the name of the placeholder. We’ve used a character class that allows alphanumeric, dashes, and underscores, which should fit most use-cases.

这个模式的核心是一个用于占位符名称的捕获组。我们使用了一个允许字母数字、破折号和下划线的字符类,这应该适合大多数的使用情况。

However, to make the code more readable, we’ve named this capturing group placeholder. Let’s see how to use that named capturing group:

然而,为了使代码更易读,我们将这个捕获组命名为placeholder。让我们看看如何使用这个命名的捕获组。

assertThat(replaceTokens("Hi ${name} at ${company}",
  "\\$\\{(?<placeholder>[A-Za-z0-9-_]+)}",
  match -> placeholderValues.get(match.group("placeholder"))))
  .isEqualTo("Hi Bill at Baeldung");

Here we can see that getting the value of the named group out of the Matcher just involves using group with the name as the input, rather than the number.

在这里我们可以看到,从Matcher中获取命名组的值只需要使用group的名称作为输入,而不是数字。

6. Conclusion

6.结论

In this article, we looked at how to use powerful regular expressions to find tokens in our strings. We learned how the find method works with Matcher to show us the matches.

在这篇文章中,我们研究了如何使用强大的正则表达式来寻找字符串中的标记。我们学习了find方法如何与Matcher一起工作,向我们展示匹配的结果。

Then we created and generalized an algorithm to allow us to do token-by-token replacement.

然后,我们创建并推广了一种算法,使我们能够进行逐个令牌的替换。

Finally, we looked at a couple of common use-cases for escaping characters and populating templates.

最后,我们看了几个常见的转义字符和填充模板的使用情况。

As always, the code examples can be found over on GitHub.

一如既往,代码示例可以在GitHub上找到