Guide to Escaping Characters in Java RegExps – Java正则中的转义字符指南

最后修改: 2017年 5月 31日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

The regular expressions API in Java, java.util.regex is widely used for pattern matching. To discover more, you can follow this article.

Java中的正则表达式API,java.util.regex被广泛用于模式匹配。要了解更多信息,你可以关注这篇文章

In this article, we will focus on escaping characters withing a regular expression and show how it can be done in Java.

在这篇文章中,我们将重点讨论在正则表达式中转义字符的问题,并展示如何在Java中实现这一目标。

2. Special RegExp Characters

2.特殊的正则字符

According to the Java regular expressions API documentation, there is a set of special characters also known as metacharacters present in a regular expression.

根据Java正则表达式API文档,在正则表达式中存在一组特殊字符,也被称为元字符。

When we want to allow the characters as is instead of interpreting them with their special meanings, we need to escape them. By escaping these characters, we force them to be treated as ordinary characters when matching a string with a given regular expression.

当我们想让这些字符保持原样,而不是用它们的特殊含义进行解释时,我们需要转义它们。通过转义这些字符,在用给定的正则表达式匹配字符串时,我们强制将它们作为普通字符处理。

The metacharacters that we usually need to escape in this manner are:

我们通常需要以这种方式转义的元字符是。

<([{\^-=$!|]})?*+.>

<([{^-=$!|]})? *+.>

Let’s look at a simple code example where we match an input String with a pattern expressed in a regular expression.

让我们看一个简单的代码例子,我们用正则表达式中表达的模式来匹配一个输入String

This test shows that for a given input string foof when the pattern foo. (foo ending with a dot character) is matched, it returns a value of true which indicates that the match is successful.

这个测试表明,对于一个给定的输入字符串foof,当模式foo.(foo以一个点字符结尾)被匹配时,它返回一个true的值,表明匹配成功了。

@Test
public void givenRegexWithDot_whenMatchingStr_thenMatches() {
    String strInput = "foof";
    String strRegex = "foo.";
      
    assertEquals(true, strInput.matches(strRegex));
}

You may wonder why is the match successful when there is no dot (.) character present in the input String?

你可能会问,为什么在输入的字符串中没有点(.)字符,却能匹配成功?

The answer is simple. The dot (.) is a metacharacter – the special significance of dot here is that there can be ‘any character’ in its place. Therefore, it’s clear how the matcher determined that a match is found.

答案很简单。点(.)是一个元字符–这里的点的特殊意义在于,可以有 “任何字符 “代替它。因此,很明显,匹配器是如何确定找到了一个匹配的。

Let’s say that we do not want to treat the dot (.) character with its unique meaning. Instead, we want it to be interpreted as a dot sign. This means that in the previous example, we do not want to let the pattern foo. to have a match in the input String.

比方说,我们不想用点(.)字符的独特含义来处理它。相反,我们希望它被解释为一个点符号。这意味着在前面的例子中,我们不希望让模式foo.在输入String.中匹配。

How would we handle a situation like this? The answer is: we need to escape the dot (.) character so that its special meaning gets ignored.

我们会如何处理这样的情况?答案是。我们需要转义点(.)字符,使其特殊含义被忽略。

Let’s dig into it in more detail in the next section.

让我们在下一节中更详细地挖掘它。

3. Escaping Characters

3.逃跑的角色

According to the Java API documentation for regular expressions, there are two ways in which we can escape characters that have special meaning. In other words, to force them to be treated as ordinary characters.

根据Java API的正则表达式文档,我们有两种方法可以转义具有特殊含义的字符。换句话说,就是强迫它们被当作普通字符来处理。

Let’s see what they are:

让我们看看它们是什么。

  1. Precede a metacharacter with a backslash (\)
  2. Enclose a metacharacter with \Q and \E

This just means that in the example we saw earlier, if we want to escape the dot character, we need to put a backslash character before the dot character. Alternatively, we can place the dot character in between \Q and \E.

这只是意味着在我们之前看到的例子中,如果我们想转义点字符,我们需要在点字符之前放一个反斜线字符。或者,我们可以将点字符放在 \Q 和 \E 之间。

3.1. Escaping Using Backslash

3.1.使用反斜线逃逸

This is one of the techniques that we can use to escape metacharacters in a regular expression. However, we know that the backslash character is an escape character in Java String literals as well. Therefore, we need to double the backslash character when using it to precede any character (including the \ character itself).

这是我们可以用来在正则表达式中转义元字符的技术之一。然而,我们知道反斜杠字符在Java String字面中也是一个转义字符。因此,当使用反斜杠字符在任何字符(包括\字符本身)前面时,我们需要将其加倍。

Hence in our example, we need to change the regular expression as shown in this test:

因此,在我们的例子中,我们需要改变正则表达式,如这个测试中所示。

@Test
public void givenRegexWithDotEsc_whenMatchingStr_thenNotMatching() {
    String strInput = "foof";
    String strRegex = "foo\\.";

    assertEquals(false, strInput.matches(strRegex));
}

Here, the dot character is escaped, so the matcher simply treats it as a dot and tries to find a pattern that ends with the dot (i.e. foo.).

在这里,点字符被转义了,所以匹配器只是把它当作一个点,并试图找到一个以点结尾的模式(即foo.)。

In this case, it returns false since there is no match in the input String for that pattern.

在这种情况下,它返回false,因为在输入的String中没有匹配的模式。

3.2. Escaping Using \Q & \E

3.2.使用 \Q & \E 逃亡

Alternatively, we can use \Q and \E to escape the special character. \Q indicates that all characters up to \E needs to be escaped and \E means we need to end the escaping that was started with \Q.

另外,我们可以使用QE来转义特殊字符。\Q表示所有到\E的字符都需要转义,\E表示我们需要结束用\Q开始的转义。

This just means that whatever is in between \Q and \E would be escaped.

这只是意味着在QE之间的东西将被转义。

In the test shown here, the split() of the String class does a match using the regular expression provided to it.

在这里显示的测试中,split() String类使用提供给它的正则表达式进行匹配。

Our requirement is to split the input string by the pipe (|) character into words. Therefore, we use a regular expression pattern to do so.

我们的要求是将输入的字符串通过管道(|)字符分割成单词。因此,我们使用正则表达式模式来做这件事。

The pipe character is a metacharacter that needs to be escaped in the regular expression.

管道字符是一个元字符,需要在正则表达式中转义。

Here, the escaping is done by placing the pipe character between \Q and \E:

这里,转义是通过将管道字符放在\Q\E之间完成的。

@Test
public void givenRegexWithPipeEscaped_whenSplitStr_thenSplits() {
    String strInput = "foo|bar|hello|world";
    String strRegex = "\\Q|\\E";
    
    assertEquals(4, strInput.split(strRegex).length);
}

4. The Pattern.quote(String S) Method

4、Pattern.quote(String S)方法

The Pattern.Quote(String S) Method in java.util.regex.Pattern class converts a given regular expression pattern String into a literal pattern String. This means that all metacharacters in the input String are treated as ordinary characters.

java.util.regex.Pattern类中的Pattern.Quote(String S)方法将给定的正则表达式模式String转换为字面模式String.这意味着输入String中的所有元字符被视为普通字符。

Using this method would be a more convenient alternative than using \Q & \E as it wraps the given String with them.

使用这个方法会比使用Q & E更方便,因为它用给定的String包裹了它们。

Let’s see this method in action:

让我们看看这个方法的实际效果。

@Test
public void givenRegexWithPipeEscQuoteMeth_whenSplitStr_thenSplits() {
    String strInput = "foo|bar|hello|world";
    String strRegex = "|";

    assertEquals(4,strInput.split(Pattern.quote(strRegex)).length);
}

In this quick test, the Pattern.quote() method is used to escape the given regex pattern and transform it into a String literal. In other words, it escapes all the metacharacters present in the regex pattern for us. It is doing a similar job to \Q & \E.

在这个快速测试中,Pattern.quote() 方法被用来转义给定的regex模式并将其转化为String字面。换句话说,它为我们转义了regex模式中的所有元字符。它的工作与QE类似。

The pipe character is escaped by the Pattern.quote() method and the split() interprets it as a String literal by which it divides the input.

管道字符被Pattern.quote()方法转义,split()将其解释为一个String字头,它将输入的内容进行分割。

As we can see, this is a much cleaner approach and also the developers do not have to remember all the escape sequences.

正如我们所看到的,这是一个更简洁的方法,而且开发人员也不必记住所有的转义序列。

We should note that Pattern.quote encloses the whole block with a single escape sequence. If we wanted to escape characters individually, we would need to use a token replacement algorithm.

我们应该注意到,Pattern.quote用一个转义序列包围了整个块。如果我们想单独转义字符,我们将需要使用token替换算法

5. Additional Examples

5.其他例子

Let’s look at how the replaceAll() method of java.util.regex.Matcher works.

我们来看看java.util.regex.MatcherreplaceAll()方法是如何工作的。

If we need to replace all occurrences of a given character String with another, we can use this method by passing a regular expression to it.

如果我们需要用一个给定的字符String替换所有出现的字符,我们可以使用这个方法,向它传递一个正则表达式。

Imagine we have an input with multiple occurrences of the $ character. The result we want to get is the same string with the $ character replaced by £.

想象一下,我们有一个多处出现$字符的输入。我们想得到的结果是同一个字符串,其中的$字符被替换为£。

This test demonstrates how the pattern $ is passed without being escaped:

这个测试展示了模式$是如何被传递而不被转义的。

@Test
public void givenRegexWithDollar_whenReplacing_thenNotReplace() {
 
    String strInput = "I gave $50 to my brother."
      + "He bought candy for $35. Now he has $15 left.";
    String strRegex = "$";
    String strReplacement = "£";
    String output = "I gave £50 to my brother."
      + "He bought candy for £35. Now he has £15 left.";
    
    Pattern p = Pattern.compile(strRegex);
    Matcher m = p.matcher(strInput);
        
    assertThat(output, not(equalTo(m.replaceAll(strReplacement))));
}

The test asserts that $ is not correctly replaced by £.

该测试断言,$没有被£正确替换。

Now if we escape the regex pattern, the replacing happens correctly, and the test passes as shown in this code snippet:

现在,如果我们转义铰链模式,替换就会正确发生,测试就会通过,如这个代码片段所示。

@Test
public void givenRegexWithDollarEsc_whenReplacing_thenReplace() {
 
    String strInput = "I gave $50 to my brother."
      + "He bought candy for $35. Now he has $15 left.";
    String strRegex = "\\$";
    String strReplacement = "£";
    String output = "I gave £50 to my brother."
      + "He bought candy for £35. Now he has £15 left.";
    Pattern p = Pattern.compile(strRegex);
    Matcher m = p.matcher(strInput);
    
    assertEquals(output,m.replaceAll(strReplacement));
}

Note the \\$ here, which does the trick by escaping the $ character and successfully matching the pattern.

注意这里的\$,它通过转义$字符并成功匹配模式来完成这个任务。

6. Conclusion

6.结论

In this article, we looked at escaping characters in regular expressions in Java.

在这篇文章中,我们研究了Java中正则表达式的转义字符。

We discussed why regular expressions need to be escaped, and the different ways in which it can be achieved.

我们讨论了为什么正则表达式需要转义,以及实现转义的不同方法。

As always, the source code related to this article can be found over on GitHub.

像往常一样,与本文有关的源代码可以在GitHub上找到