1. Overview
1.概述
In this tutorial, we’ll discuss the Java Regex API, and how we can use regular expressions in the Java programming language.
在本教程中,我们将讨论Java Regex API,以及我们如何在Java编程语言中使用正则表达式。
In the world of regular expressions, there are many different flavors to choose from, such as grep, Perl, Python, PHP, awk, and much more.
在正则表达式的世界里,有许多不同的口味可供选择,如grep、Perl、Python、PHP、awk等等。
This means that a regular expression that works in one programming language, may not work in another. The regular expression syntax in Java is most similar to that found in Perl.
这意味着在一种编程语言中有效的正则表达式,在另一种语言中可能无效。Java中的正则表达式语法与Perl中的语法最为相似。
2. Setup
2.设置
To use regular expressions in Java, we don’t need any special setup. The JDK contains a special package, java.util.regex, totally dedicated to regex operations. We only need to import it into our code.
要在 Java 中使用正则表达式,我们不需要任何特殊的设置。JDK 包含一个特殊的包, java.util.regex, 完全用于正则表达式操作。我们只需要把它导入我们的代码中。
Moreover, the java.lang.String class also has inbuilt regex support that we commonly use in our code.
此外,java.lang.String类也有内置的regex支持,我们在代码中经常使用。
3. Java Regex Package
3.Java Regex包
The java.util.regex package consists of three classes: Pattern, Matcher, and PatternSyntaxException:
java.util.regex包由三个类组成。Pattern, Matcher, 和PatternSyntaxException:
- Pattern object is a compiled regex. The Pattern class provides no public constructors. To create a pattern, we must first invoke one of its public static compile methods, which will then return a Pattern object. These methods accept a regular expression as the first argument.
- Matcher object interprets the pattern and performs match operations against an input String. It also defines no public constructors. We obtain a Matcher object by invoking the matcher method on a Pattern object.
- PatternSyntaxException object is an unchecked exception that indicates a syntax error in a regular expression pattern.
We’ll explore these classes in detail; however, we must first understand how to construct a regex in Java.
我们将详细探讨这些类;但是,我们必须首先了解如何在Java中构造一个regex。
If we’re already familiar with regex from a different environment, we may find certain differences, but they’re minimal.
如果我们已经在不同的环境中熟悉了regex,我们可能会发现某些差异,但这些差异是最小的。
4. Simple Example
4.简单的例子
Let’s start with the simplest use case for a regex. As we noted earlier, when we apply a regex to a String, it may match zero or more times.
让我们从regex最简单的用例开始。正如我们前面所指出的,当我们将regex应用于一个字符串时,它可能会匹配零次或多次。
The most basic form of pattern matching supported by the java.util.regex API is the match of a String literal. For example, if the regular expression is foo and the input String is foo, the match will succeed because the Strings are identical:
java.util.regex API 所支持的模式匹配的最基本形式是 String 字面的匹配。例如,如果正则表达式是foo,而输入的String是foo,匹配将会成功,因为String是相同的。
@Test
public void givenText_whenSimpleRegexMatches_thenCorrect() {
Pattern pattern = Pattern.compile("foo");
Matcher matcher = pattern.matcher("foo");
assertTrue(matcher.find());
}
We’ll first create a Pattern object by calling its static compile method and passing it a pattern we want to use.
我们首先创建一个Pattern对象,调用它的静态compile方法,并传递给它一个我们想要使用的模式。
Then we’ll create a Matcher object be calling the Pattern object’s matcher method and passing it the text we want to check for matches.
然后,我们将创建一个Matcher对象,调用Pattern对象的matcher方法,并将我们想要检查的文本传递给它进行匹配。
Finally, we’ll call the method find in the Matcher object.
最后,我们将调用Matcher对象中的find方法。
The find method keeps advancing through the input text and returns true for every match, so we can use it to find the match count as well:
find方法在输入文本中不断前进,每一个匹配都返回true,所以我们也可以用它来查找匹配数。
@Test
public void givenText_whenSimpleRegexMatchesTwice_thenCorrect() {
Pattern pattern = Pattern.compile("foo");
Matcher matcher = pattern.matcher("foofoo");
int matches = 0;
while (matcher.find()) {
matches++;
}
assertEquals(matches, 2);
}
Since we’ll be running more tests, we can abstract the logic for finding the number of matches in a method called runTest:
由于我们将运行更多的测试,我们可以在一个叫做runTest的方法中抽象出寻找匹配数量的逻辑。
public static int runTest(String regex, String text) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
int matches = 0;
while (matcher.find()) {
matches++;
}
return matches;
}
When we get 0 matches, the test should fail; otherwise, it should pass.
当我们得到0个匹配时,测试应该失败;否则,它应该通过。
5. Meta Characters
5.元气人物
Meta characters affect the way a pattern is matched; in a way, they add logic to the search pattern. The Java API supports several meta characters, the most straightforward being the dot “.”, which matches any character:
元字符影响模式的匹配方式;在某种程度上,它们为搜索模式增加了逻辑。Java API支持几个元字符,最直接的是点“.”,可以匹配任何字符。
@Test
public void givenText_whenMatchesWithDotMetach_thenCorrect() {
int matches = runTest(".", "foo");
assertTrue(matches > 0);
}
Let’s consider the previous example, where the regex foo matched the text foo, as well as foofoo, two times. If we use the dot meta character in the regex, we won’t get two matches in the second case:
让我们考虑一下前面的例子,重词foo匹配文本foo,以及foofoo,两次。如果我们在重词中使用点元字符,在第二种情况下我们就不会得到两个匹配。
@Test
public void givenRepeatedText_whenMatchesOnceWithDotMetach_thenCorrect() {
int matches= runTest("foo.", "foofoo");
assertEquals(matches, 1);
}
Notice the dot after the foo in the regex. The matcher matches every text that’s preceded by foo, since the last dot part means any character after. So after finding the first foo, the rest is seen as any character. That’s why there’s only a single match.
请注意,在这个词组中,foo后面的点。匹配器匹配每一个以foo为首的文本,因为最后的点部分意味着后面的任何字符。因此,在找到第一个foo后,其余的被视为任何字符。这就是为什么只有一个匹配。
The API supports several other meta characters, <([{\^-=$!|]})?*+.>, which we’ll explore further in this article.
API支持其他几个元字符,<([{^-=$!|]})?*+.>,我们将在这篇文章中进一步探讨。
6. Character Classes
6.角色等级
Browsing through the official Pattern class specification, we’ll discover summaries of supported regex constructs. Under character classes, we have about 6 constructs.
浏览官方的Pattern类规范,我们会发现支持的regex构造的摘要。在字符类下,我们有大约6种构造。
6.1. OR Class
6.1.OR类
We construct this as [abc]. This matches any of the elements in the set:
我们将其构造为[abc]。这与集合中的任何元素相匹配。
@Test
public void givenORSet_whenMatchesAny_thenCorrect() {
int matches = runTest("[abc]", "b");
assertEquals(matches, 1);
}
If they all appear in the text, it’ll match each element separately with no regard to the order:
如果它们都出现在文本中,它将分别匹配每个元素,而不考虑其顺序。
@Test
public void givenORSet_whenMatchesAnyAndAll_thenCorrect() {
int matches = runTest("[abc]", "cab");
assertEquals(matches, 3);
}
They can also be alternated as part of a String. In the following example, when we create different words by alternating the first letter with each element of the set, they’re all matched:
它们也可以作为String的一部分被交替使用。在下面的例子中,当我们通过交替使用集合中的每个元素的第一个字母来创建不同的单词时,它们都被匹配。
@Test
public void givenORSet_whenMatchesAllCombinations_thenCorrect() {
int matches = runTest("[bcr]at", "bat cat rat");
assertEquals(matches, 3);
}
6.2. NOR Class
6.2.NOR类
The above set is negated by adding a caret as the first element:
上述集合通过添加一个粗体字作为第一个元素而被否定。
@Test
public void givenNORSet_whenMatchesNon_thenCorrect() {
int matches = runTest("[^abc]", "g");
assertTrue(matches > 0);
}
Here’s another case:
这里有另一个案例。
@Test
public void givenNORSet_whenMatchesAllExceptElements_thenCorrect() {
int matches = runTest("[^bcr]at", "sat mat eat");
assertTrue(matches > 0);
}
6.3. Range Class
6.3.范围类
We can define a class that specifies the range that the matched text should fall within by using a hyphen(-). Likewise, we can also negate a range.
我们可以定义一个类,通过使用连字符(-)来指定匹配的文本应该属于的范围。同样地,我们也可以否定一个范围。
Matching uppercase letters:
匹配的大写字母。
@Test
public void givenUpperCaseRange_whenMatchesUpperCase_
thenCorrect() {
int matches = runTest(
"[A-Z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 2);
}
Matching lowercase letters:
匹配的小写字母。
@Test
public void givenLowerCaseRange_whenMatchesLowerCase_
thenCorrect() {
int matches = runTest(
"[a-z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 26);
}
Matching both upper case and lower case letters:
匹配大写和小写字母。
@Test
public void givenBothLowerAndUpperCaseRange_
whenMatchesAllLetters_thenCorrect() {
int matches = runTest(
"[a-zA-Z]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 28);
}
Matching a given range of numbers:
匹配一个给定的数字范围。
@Test
public void givenNumberRange_whenMatchesAccurately_
thenCorrect() {
int matches = runTest(
"[1-5]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 2);
}
Matching another range of numbers:
匹配另一个范围的数字。
@Test
public void givenNumberRange_whenMatchesAccurately_
thenCorrect2(){
int matches = runTest(
"3[0-5]", "Two Uppercase alphabets 34 overall");
assertEquals(matches, 1);
}
6.4. Union Class
6.4.联盟类
A union character class is the result of combining two or more character classes:
一个联合字符类是由两个或多个字符类组合而成的。
@Test
public void givenTwoSets_whenMatchesUnion_thenCorrect() {
int matches = runTest("[1-3[7-9]]", "123456789");
assertEquals(matches, 6);
}
The above test will only match six out of the nine integers because the union set skips 4, 5, and 6.
上述测试只能匹配9个整数中的6个,因为联合集跳过了4、5和6。
6.5. Intersection Class
6.5.交叉班
Similar to the union class, this class results from picking common elements between two or more sets. To apply intersection, we use the &&:
与union类相似,这个类的结果是在两个或多个集合之间挑选共同的元素。为了应用相交,我们使用&&。
@Test
public void givenTwoSets_whenMatchesIntersection_thenCorrect() {
int matches = runTest("[1-6&&[3-9]]", "123456789");
assertEquals(matches, 4);
}
We’ll get four matches because the intersection of the two sets has only four elements.
我们会得到四个匹配,因为这两个集合的交集只有四个元素。
6.6. Subtraction Class
6.6.减法班
We can use subtraction to negate one or more character classes. For example, we can match a set of odd decimal numbers:
我们可以用减法来否定一个或多个字符类。例如,我们可以匹配一组奇数的十进制数字。
@Test
public void givenSetWithSubtraction_whenMatchesAccurately_thenCorrect() {
int matches = runTest("[0-9&&[^2468]]", "123456789");
assertEquals(matches, 5);
}
Only 1, 3, 5, 7, 9 will be matched.
只有1, 3, 5, 7, 9会被匹配。
7. Predefined Character Classes
7.预定义的角色类
The Java regex API also accepts predefined character classes. Some of the above character classes can be expressed in shorter form, although this makes the code less intuitive. One special aspect of the Java version of this regex is the escape character.
Java regex API也接受预定义的字符类。上述的一些字符类可以用更短的形式来表达,尽管这使得代码不那么直观。这个regex的Java版本的一个特殊方面是转义字符。
As we’ll see, most characters will start with a backslash, which has a special meaning in Java. For these to be compiled by the Pattern class, the leading backslash must be escaped, i.e. \d becomes \\d.
正如我们将看到的,大多数字符将以反斜杠开始,这在Java中有特殊的含义。为了使这些字符能够被Pattern类编译,前面的反斜杠必须被转义,即d变成\d。
Matching digits, equivalent to [0-9]:
匹配的数字,相当于[0-9]。
@Test
public void givenDigits_whenMatches_thenCorrect() {
int matches = runTest("\\d", "123");
assertEquals(matches, 3);
}
Matching non-digits, equivalent to [^0-9]:
匹配非数字,相当于[^0-9]。
@Test
public void givenNonDigits_whenMatches_thenCorrect() {
int mathces = runTest("\\D", "a6c");
assertEquals(matches, 2);
}
Matching white space:
匹配的白色空间。
@Test
public void givenWhiteSpace_whenMatches_thenCorrect() {
int matches = runTest("\\s", "a c");
assertEquals(matches, 1);
}
Matching non-white space:
匹配的非白色空间。
@Test
public void givenNonWhiteSpace_whenMatches_thenCorrect() {
int matches = runTest("\\S", "a c");
assertEquals(matches, 2);
}
Matching a word character, equivalent to [a-zA-Z_0-9]:
匹配一个单词字符,相当于[a-zA-Z_0-9]。
@Test
public void givenWordCharacter_whenMatches_thenCorrect() {
int matches = runTest("\\w", "hi!");
assertEquals(matches, 2);
}
Matching a non-word character:
匹配一个非字的字符。
@Test
public void givenNonWordCharacter_whenMatches_thenCorrect() {
int matches = runTest("\\W", "hi!");
assertEquals(matches, 1);
}
8. Quantifiers
8.定量词
The Java regex API also allows us to use quantifiers. These enable us to further tweak the match’s behavior by specifying the number of occurrences to match against.
Java regex API还允许我们使用量词。这使我们能够通过指定要匹配的出现次数来进一步调整匹配的行为。
To match a text zero or one time, we use the ? quantifier:
要想零次或一次匹配一个文本,我们使用?量词。
@Test
public void givenZeroOrOneQuantifier_whenMatches_thenCorrect() {
int matches = runTest("\\a?", "hi");
assertEquals(matches, 3);
}
Alternatively, we can use the brace syntax, which is also supported by the Java regex API:
另外,我们可以使用括号语法,这也是Java regex API所支持的。
@Test
public void givenZeroOrOneQuantifier_whenMatches_thenCorrect2() {
int matches = runTest("\\a{0,1}", "hi");
assertEquals(matches, 3);
}
This example introduces the concept of zero-length matches. It so happens that if a quantifier’s threshold for matching is zero, it always matches everything in the text, including an empty String at the end of every input. This means that even if the input is empty, it’ll return one zero-length match.
这个例子介绍了零长度匹配的概念。碰巧的是,如果量化器的匹配阈值为零,它总是匹配文本中的所有内容,包括每次输入末尾的空String。这意味着,即使输入是空的,它也会返回一个零长度的匹配。
This explains why we get three matches in the above example, despite having a String of length two. The third match is zero-length empty String.
这就解释了为什么我们在上面的例子中得到三个匹配,尽管有一个长度为2的String。第三个匹配是零长度的空String。
To match a text zero or limitless times, we us the * quantifier, which is similar to ?:
为了匹配一个文本的零次或无限次,我们使用*量词,它类似于?
@Test
public void givenZeroOrManyQuantifier_whenMatches_thenCorrect() {
int matches = runTest("\\a*", "hi");
assertEquals(matches, 3);
}
Supported alternative:
支持的替代方案。
@Test
public void givenZeroOrManyQuantifier_whenMatches_thenCorrect2() {
int matches = runTest("\\a{0,}", "hi");
assertEquals(matches, 3);
}
The quantifier with a difference is +, which has a matching threshold of one. If the required String doesn’t occur at all, there will be no match, not even a zero-length String:
有区别的量词是+,它的匹配阈值是1。如果所需的String完全没有出现,就不会有匹配,甚至不会有零长度的String。
@Test
public void givenOneOrManyQuantifier_whenMatches_thenCorrect() {
int matches = runTest("\\a+", "hi");
assertFalse(matches);
}
Supported alternative:
支持的替代方案。
@Test
public void givenOneOrManyQuantifier_whenMatches_thenCorrect2() {
int matches = runTest("\\a{1,}", "hi");
assertFalse(matches);
}
As in Perl and other languages, we can use the brace syntax to match a given text a number of times:
就像在Perl和其他语言中一样,我们可以使用大括号语法来多次匹配一个给定的文本。
@Test
public void givenBraceQuantifier_whenMatches_thenCorrect() {
int matches = runTest("a{3}", "aaaaaa");
assertEquals(matches, 2);
}
In the above example, we get two matches, since a match occurs only if a appears three times in a row. However, in the next test, we won’t get a match because the text only appears two times in a row:
在上面的例子中,我们得到了两个匹配,因为只有当a连续出现三次时才会出现匹配。然而,在下一个测试中,我们不会得到一个匹配,因为该文本只连续出现两次。
@Test
public void givenBraceQuantifier_whenFailsToMatch_thenCorrect() {
int matches = runTest("a{3}", "aa");
assertFalse(matches > 0);
}
When we use a range in the brace, the match will be greedy, matching from the higher end of the range:
当我们在括号中使用一个范围时,匹配将是贪婪的,从范围的高端开始匹配。
@Test
public void givenBraceQuantifierWithRange_whenMatches_thenCorrect() {
int matches = runTest("a{2,3}", "aaaa");
assertEquals(matches, 1);
}
Here we specified at least two occurrences, but not exceeding three, so we get a single match where the matcher sees a single aaa and a lone a, which can’t be matched.
这里我们指定了至少两个出现,但不超过三个,所以我们得到了一个单一的匹配,匹配器看到了一个aaa和一个不能被匹配的孤独的a,。
However, the API allows us to specify a lazy or reluctant approach such that the matcher can start from the lower end of the range, matching two occurrences as aa and aa:
然而,API允许我们指定一个懒惰或勉强的方法,这样匹配器可以从范围的低端开始,将两个出现的情况匹配为aa和aa。
@Test
public void givenBraceQuantifierWithRange_whenMatchesLazily_thenCorrect() {
int matches = runTest("a{2,3}?", "aaaa");
assertEquals(matches, 2);
}
9. Capturing Groups
9.捕获群体
The API also allows us to treat multiple characters as a single unit through capturing groups. It will attach numbers to the capturing groups, and allow back referencing using these numbers.
该API还允许我们通过抓取组将多个字符作为一个单位对待。它将把数字附加到捕获组中,并允许使用这些数字进行反向引用。
In this section, we’ll see a few examples of how to use capturing groups in the Java regex API.
在本节中,我们将看到几个如何在Java regex API中使用捕获组的例子。
Let’s use a capturing group that matches only when an input text contains two digits next to each other:
让我们使用一个捕捉组,只在输入文本包含两个相邻的数字时进行匹配。
@Test
public void givenCapturingGroup_whenMatches_thenCorrect() {
int matches = runTest("(\\d\\d)", "12");
assertEquals(matches, 1);
}
The number attached to the above match is 1, using a back reference to tell the matcher that we want to match another occurrence of the matched portion of the text. This way, instead of having two separate matches for the input:
上述匹配所附的数字是1,使用一个反向参考来告诉匹配器,我们要匹配文本中被匹配部分的另一个出现。这样一来,就不会有两个单独的匹配输入了。
@Test
public void givenCapturingGroup_whenMatches_thenCorrect2() {
int matches = runTest("(\\d\\d)", "1212");
assertEquals(matches, 2);
}
We can have one match, but propagating the same regex match to span the entire length of the input using back referencing:
我们可以有一个匹配,但使用反向引用的方式传播同一个词条匹配来跨越整个输入的长度。
@Test
public void givenCapturingGroup_whenMatchesWithBackReference_
thenCorrect() {
int matches = runTest("(\\d\\d)\\1", "1212");
assertEquals(matches, 1);
}
We would have to repeat the regex without back referencing to achieve the same result:
我们将不得不重复这个没有反向引用的重组词以达到相同的结果。
@Test
public void givenCapturingGroup_whenMatches_thenCorrect3() {
int matches = runTest("(\\d\\d)(\\d\\d)", "1212");
assertEquals(matches, 1);
}
Similarly, for any other number of repetitions, back referencing can make the matcher see the input as a single match:
同样地,对于任何其他数量的重复,反向引用可以使匹配器将输入视为单一的匹配。
@Test
public void givenCapturingGroup_whenMatchesWithBackReference_
thenCorrect2() {
int matches = runTest("(\\d\\d)\\1\\1\\1", "12121212");
assertEquals(matches, 1);
}
But if we change even the last digit, the match will fail:
但如果我们甚至改变最后一个数字,匹配就会失败。
@Test
public void givenCapturingGroupAndWrongInput_
whenMatchFailsWithBackReference_thenCorrect() {
int matches = runTest("(\\d\\d)\\1", "1213");
assertFalse(matches > 0);
}
It’s important not to forget the escape backslashes, which are crucial in Java syntax.
重要的是不要忘记转义反斜线,这在Java语法中是至关重要的。
10. Boundary Matchers
10 边界匹配器
The Java regex API also supports boundary matching. If we care about where exactly in the input text the match should occur, then this is what we’re looking for. With the previous examples, all we cared about was whether or not a match was found.
Java regex API也支持边界匹配。如果我们关心的是在输入的文本中到底应该在什么地方出现匹配,那么这就是我们要寻找的东西。在前面的例子中,我们所关心的是是否找到了一个匹配。
To match only when the required regex is true at the beginning of the text, we use the caret ^.
为了只在文本开头所需的重组词为真时进行匹配,我们使用省略号^。
This test will pass, since the text dog can be found at the beginning:
这个测试将通过,因为可以在开头找到文本dog。
@Test
public void givenText_whenMatchesAtBeginning_thenCorrect() {
int matches = runTest("^dog", "dogs are friendly");
assertTrue(matches > 0);
}
The following test will fail:
以下测试将失败。
@Test
public void givenTextAndWrongInput_whenMatchFailsAtBeginning_
thenCorrect() {
int matches = runTest("^dog", "are dogs are friendly?");
assertFalse(matches > 0);
}
To match only when the required regex is true at the end of the text, we use the dollar character $. We’ll find a match in the following case:
为了只在文本末尾所需的重码为真时进行匹配,我们使用美元字符$。我们会在以下情况下找到匹配。
@Test
public void givenText_whenMatchesAtEnd_thenCorrect() {
int matches = runTest("dog$", "Man's best friend is a dog");
assertTrue(matches > 0);
}
And we won’t find a match here:
而我们在这里是找不到匹配的。
@Test
public void givenTextAndWrongInput_whenMatchFailsAtEnd_thenCorrect() {
int matches = runTest("dog$", "is a dog man's best friend?");
assertFalse(matches > 0);
}
If we want a match only when the required text is found at a word boundary, we use the \\b regex at the beginning and end of the regex:
如果我们只想在所需的文本在单词边界找到时进行匹配,我们就在regex的开头和结尾使用\b regex。
Space is a word boundary:
空间是一个词的边界。
@Test
public void givenText_whenMatchesAtWordBoundary_thenCorrect() {
int matches = runTest("\\bdog\\b", "a dog is friendly");
assertTrue(matches > 0);
}
The empty string at the beginning of a line is also a word boundary:
一行开头的空字符串也是一个字的边界。
@Test
public void givenText_whenMatchesAtWordBoundary_thenCorrect2() {
int matches = runTest("\\bdog\\b", "dog is man's best friend");
assertTrue(matches > 0);
}
These tests pass because the beginning of a String, as well as the space between one text and another, marks a word boundary. However, the following test shows the opposite:
这些测试通过了,因为String的开头,以及一个文本和另一个文本之间的空格,标志着一个词的边界。然而,下面的测试显示了相反的情况。
@Test
public void givenWrongText_whenMatchFailsAtWordBoundary_thenCorrect() {
int matches = runTest("\\bdog\\b", "snoop dogg is a rapper");
assertFalse(matches > 0);
}
Two-word characters appearing in a row doesn’t mark a word boundary, but we can make it pass by changing the end of the regex to look for a non-word boundary:
两个字的字符连续出现并不标志着一个词的边界,但我们可以通过改变铰链的结尾来寻找一个非词的边界来使其通过。
@Test
public void givenText_whenMatchesAtWordAndNonBoundary_thenCorrect() {
int matches = runTest("\\bdog\\B", "snoop dogg is a rapper");
assertTrue(matches > 0);
}
11. Pattern Class Methods
11.模式类方法
Previously, we only created Pattern objects in a basic way. However, this class has another variant of the compile method that accepts a set of flags alongside the regex argument, which affects the way we match the pattern.
之前,我们只以基本方式创建了Pattern对象。然而,这个类有另一个compile方法的变体,它在接受regex参数的同时接受一组标志,这影响了我们匹配模式的方式。
These flags are simply abstracted integer values. Let’s overload the runTest method in the test class, so that it can take a flag as the third argument:
这些标志是简单的抽象的整数值。让我们在测试类中重载runTest方法,这样它就可以接受一个标志作为第三个参数。
public static int runTest(String regex, String text, int flags) {
pattern = Pattern.compile(regex, flags);
matcher = pattern.matcher(text);
int matches = 0;
while (matcher.find()){
matches++;
}
return matches;
}
In this section, we’ll look at the different supported flags and how to use them.
在这一节中,我们将看看不同的支持标志以及如何使用它们。
Pattern.CANON_EQ
Pattern.CANON_EQ。
This flag enables canonical equivalence. When specified, two characters will be considered to match if, and only if, their full canonical decompositions match.
这个标志启用了典范等价法。当指定时,两个字符将被认为是匹配的,当且仅当它们的完整典型分解匹配时。
Consider the accented Unicode character é. Its composite code point is u00E9. However, Unicode also has a separate code point for its component characters e, u0065, and the acute accent, u0301. In this case, composite character u00E9 is indistinguishable from the two character sequence u0065 u0301.
考虑重音的Unicode字符é。它的综合代码点是u00E9。然而,Unicode还为其组成字符e、u0065和重音符u0301设置了单独的码位。在这种情况下,复合字符u00E9与两个字符序列u0065 u0301没有区别。
By default, matching doesn’t take canonical equivalence into account:
默认情况下,匹配并不考虑典型的等价关系。
@Test
public void givenRegexWithoutCanonEq_whenMatchFailsOnEquivalentUnicode_thenCorrect() {
int matches = runTest("\u00E9", "\u0065\u0301");
assertFalse(matches > 0);
}
But if we add the flag, then the test will pass:
但如果我们加上这个标志,那么测试就会通过。
@Test
public void givenRegexWithCanonEq_whenMatchesOnEquivalentUnicode_thenCorrect() {
int matches = runTest("\u00E9", "\u0065\u0301", Pattern.CANON_EQ);
assertTrue(matches > 0);
}
Pattern.CASE_INSENSITIVE
Pattern.CASE_INSENSITIVE。
This flag enables matching regardless of case. By default, matching takes case into account:
这个标志使匹配不考虑大小写。默认情况下,匹配会考虑到大小写。
@Test
public void givenRegexWithDefaultMatcher_whenMatchFailsOnDifferentCases_thenCorrect() {
int matches = runTest("dog", "This is a Dog");
assertFalse(matches > 0);
}
So using this flag, we can change the default behavior:
因此,使用这个标志,我们可以改变默认行为。
@Test
public void givenRegexWithCaseInsensitiveMatcher
_whenMatchesOnDifferentCases_thenCorrect() {
int matches = runTest(
"dog", "This is a Dog", Pattern.CASE_INSENSITIVE);
assertTrue(matches > 0);
}
We can also use the equivalent, embedded flag expression to achieve the same result:
我们也可以使用等价的、嵌入的标志表达式来实现同样的结果。
@Test
public void givenRegexWithEmbeddedCaseInsensitiveMatcher
_whenMatchesOnDifferentCases_thenCorrect() {
int matches = runTest("(?i)dog", "This is a Dog");
assertTrue(matches > 0);
}
Pattern.COMMENTS
Pattern.COMMENTS。
The Java API allows us to include comments using # in the regex. This can help in documenting complex regex that may not be immediately obvious to another programmer.
Java API允许我们在regex中使用#来包含注释。这可以帮助我们记录复杂的regex,这些regex对于另一个程序员来说可能不是很明显。
The comments flag makes the matcher ignore any white space or comments in the regex, and only consider the pattern. In the default matching mode, the following test would fail:
注释标志使匹配器忽略重码中的任何空白或注释,而只考虑模式。在默认的匹配模式下,下面的测试会失败。
@Test
public void givenRegexWithComments_whenMatchFailsWithoutFlag_thenCorrect() {
int matches = runTest(
"dog$ #check for word dog at end of text", "This is a dog");
assertFalse(matches > 0);
}
This is because the matcher will look for the entire regex in the input text, including the spaces and the # character. But when we use the flag, it’ll ignore the extra spaces, and all text starting with # will be seen as a comment to be ignored for each line:
这是因为匹配器会在输入的文本中寻找整个词组,包括空格和#字符。但是当我们使用这个标志时,它将忽略多余的空格,所有以#开头的文本将被视为注释,在每一行中被忽略。
@Test
public void givenRegexWithComments_whenMatchesWithFlag_thenCorrect() {
int matches = runTest(
"dog$ #check end of text","This is a dog", Pattern.COMMENTS);
assertTrue(matches > 0);
}
There’s also an alternative embedded flag expression for this:
还有一个替代的嵌入式标志表达方式。
@Test
public void givenRegexWithComments_whenMatchesWithEmbeddedFlag_thenCorrect() {
int matches = runTest(
"(?x)dog$ #check end of text", "This is a dog");
assertTrue(matches > 0);
}
Pattern.DOTALL
Pattern.DOTALL。
By default, when we use the dot “.” expression in regex, we’re matching every character in the input String until we encounter a new line character.
默认情况下,当我们在regex中使用点”. “表达式时,我们会匹配输入String中的每个字符,直到我们遇到一个新的行字符。
Using this flag, the match will include the line terminator as well. We’ll understand this better with the following examples. These examples will be a little different. Since we want to assert against the matched String, we’ll use matcher‘s group method, which returns the previous match.
使用这个标志,匹配将包括行终止符。通过下面的例子我们会更好地理解这一点。这些例子会有一些不同。由于我们想对匹配的String进行断言,我们将使用matcher的group方法,该方法返回前一个匹配。
First, let’s see the default behavior:
首先,让我们看看默认行为。
@Test
public void givenRegexWithLineTerminator_whenMatchFails_thenCorrect() {
Pattern pattern = Pattern.compile("(.*)");
Matcher matcher = pattern.matcher(
"this is a text" + System.getProperty("line.separator")
+ " continued on another line");
matcher.find();
assertEquals("this is a text", matcher.group(1));
}
As we can see, only the first part of the input before the line terminator is matched.
我们可以看到,只有行终止符之前的输入的第一部分被匹配。
Now in dotall mode, the entire text, including the line terminator, will be matched:
现在在dotall模式下,整个文本,包括行结束符,都将被匹配。
@Test
public void givenRegexWithLineTerminator_whenMatchesWithDotall_thenCorrect() {
Pattern pattern = Pattern.compile("(.*)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(
"this is a text" + System.getProperty("line.separator")
+ " continued on another line");
matcher.find();
assertEquals(
"this is a text" + System.getProperty("line.separator")
+ " continued on another line", matcher.group(1));
}
We can also use an embedded flag expression to enable dotall mode:
我们还可以使用一个嵌入式标志表达式来启用dotall模式。
@Test
public void givenRegexWithLineTerminator_whenMatchesWithEmbeddedDotall
_thenCorrect() {
Pattern pattern = Pattern.compile("(?s)(.*)");
Matcher matcher = pattern.matcher(
"this is a text" + System.getProperty("line.separator")
+ " continued on another line");
matcher.find();
assertEquals(
"this is a text" + System.getProperty("line.separator")
+ " continued on another line", matcher.group(1));
}
Pattern.LITERAL
Pattern.LITERAL。
When in this mode, the matcher gives no special meaning to any meta characters, escape characters, or regex syntax. Without this flag, the matcher will match the following regex against any input String:
当处于这种模式时,匹配器对任何元字符、转义字符或重码语法都不赋予特殊含义。如果没有这个标志,匹配器将对任何输入的String匹配以下的重码。
@Test
public void givenRegex_whenMatchesWithoutLiteralFlag_thenCorrect() {
int matches = runTest("(.*)", "text");
assertTrue(matches > 0);
}
This is the default behavior we’ve seen in all the examples. However, with this flag, we won’t find a match, since the matcher will be looking for (.*) instead of interpreting it:
这是我们在所有例子中看到的默认行为。然而,使用这个标志,我们将找不到一个匹配,因为匹配器将寻找(.*)而不是解释它。
@Test
public void givenRegex_whenMatchFailsWithLiteralFlag_thenCorrect() {
int matches = runTest("(.*)", "text", Pattern.LITERAL);
assertFalse(matches > 0);
}
Now if we add the required string, the test will pass:
现在,如果我们添加所需的字符串,测试将通过。
@Test
public void givenRegex_whenMatchesWithLiteralFlag_thenCorrect() {
int matches = runTest("(.*)", "text(.*)", Pattern.LITERAL);
assertTrue(matches > 0);
}
There’s no embedded flag character for enabling literal parsing.
没有嵌入的标志字符用于启用字面解析。
Pattern.MULTILINE
Pattern.MULTILINE。
By default, the ^ and $ meta characters match absolutely at the beginning and end, respectively, of the entire input String. The matcher disregards any line terminators:
默认情况下,^和$元字符分别在整个输入String的开头和结尾进行绝对匹配。匹配器不考虑任何行终止符。
@Test
public void givenRegex_whenMatchFailsWithoutMultilineFlag_thenCorrect() {
int matches = runTest(
"dog$", "This is a dog" + System.getProperty("line.separator")
+ "this is a fox");
assertFalse(matches > 0);
}
This match will fail because the matcher searches for dog at the end of the entire String, but the dog is present at the end of the first line of the string.
这个匹配将失败,因为匹配器在整个String的末尾搜索dog,但是dog在字符串的第一行末尾出现。
However, with the flag, the same test will pass, since the matcher now takes into account line terminators. So the String dog is found just before the line terminates, meaning success:
然而,有了这个标志,同样的测试会通过,因为匹配器现在考虑到了行的终止符。因此,字符串dog在行结束前被找到,意味着成功。
@Test
public void givenRegex_whenMatchesWithMultilineFlag_thenCorrect() {
int matches = runTest(
"dog$", "This is a dog" + System.getProperty("line.separator")
+ "this is a fox", Pattern.MULTILINE);
assertTrue(matches > 0);
}
Here’s the embedded flag version:
这里是嵌入的国旗版本。
@Test
public void givenRegex_whenMatchesWithEmbeddedMultilineFlag_
thenCorrect() {
int matches = runTest(
"(?m)dog$", "This is a dog" + System.getProperty("line.separator")
+ "this is a fox");
assertTrue(matches > 0);
}
12. Matcher Class Methods
12.匹配器类方法
In this section, we’ll learn about the useful methods of the Matcher class. We’ll group them according to functionality for clarity.
在本节中,我们将学习Matcher类的有用方法。为了清楚起见,我们将根据功能对它们进行分组。
12.1. Index Methods
12.1.索引方法
Index methods provide useful index values that show us precisely where to find the match in the input String. In the following test, we’ll confirm the start and end indices of the match for dog in the input String:
索引方法提供了有用的索引值,准确地告诉我们在输入String中找到匹配的位置。在下面的测试中,我们将确认dog在输入String中的匹配的开始和结束索引。
@Test
public void givenMatch_whenGetsIndices_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("This dog is mine");
matcher.find();
assertEquals(5, matcher.start());
assertEquals(8, matcher.end());
}
12.2. Study Methods
12.2.研究方法
Study methods go through the input String and return a boolean indicating whether or not the pattern was found. Commonly used are the matches and lookingAt methods.
研究方法穿过输入的String,并返回一个布尔值,表明是否找到了该模式。常用的是matches和lookingAt方法。
The matches and lookingAt methods both attempt to match an input sequence against a pattern. The difference is that matches requires the entire input sequence to be matched, while lookingAt doesn’t.
matches和lookingAt方法都试图将一个输入序列与一个模式进行匹配。不同的是,matches需要匹配整个输入序列,而lookingAt不需要。
Both methods start at the beginning of the input String :
这两种方法都是从输入的String开始的。
@Test
public void whenStudyMethodsWork_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dogs are friendly");
assertTrue(matcher.lookingAt());
assertFalse(matcher.matches());
}
The matches method will return true in a case like this:
在这样的情况下,匹配方法将返回真。
@Test
public void whenMatchesStudyMethodWorks_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher("dog");
assertTrue(matcher.matches());
}
12.3. Replacement Methods
12.3.替换方法
Replacement methods are useful to replace text in an input string. The common ones are replaceFirst and replaceAll.
替换方法对于替换输入字符串中的文本非常有用。常见的有replaceFirst和replaceAll。
The replaceFirst and replaceAll methods replace the text that matches a given regular expression. As their names indicates, replaceFirst replaces the first occurrence, and replaceAll replaces all occurrences:
replaceFirst和replaceAll方法替换与给定正则表达式匹配的文本。正如它们的名字所示,replaceFirst 替换第一次出现的文本,而replaceAll 替换所有出现的文本。
@Test
public void whenReplaceFirstWorks_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher(
"dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceFirst("cat");
assertEquals(
"cats are domestic animals, dogs are friendly", newStr);
}
Replace all occurrences:
替换所有出现的情况。
@Test
public void whenReplaceAllWorks_thenCorrect() {
Pattern pattern = Pattern.compile("dog");
Matcher matcher = pattern.matcher(
"dogs are domestic animals, dogs are friendly");
String newStr = matcher.replaceAll("cat");
assertEquals("cats are domestic animals, cats are friendly", newStr);
}
The replaceAll method allows us to substitute all matches with the same replacement. If we want to replace matches on a case by basis, we’d need a token replacement technique.
replaceAll方法允许我们用相同的替换来替换所有的匹配。如果我们想逐个替换匹配项,我们就需要一个token替换技术。
13. Conclusion
13.结论
In this article, we learned how to use regular expressions in Java. We also explored the most important features of the java.util.regex package.
在这篇文章中,我们学习了如何在 Java 中使用正则表达式。我们还探索了java.util.regex包的最重要功能。
The full source code for the project, including all the code samples used here, can be found in the GitHub project.
项目的完整源代码,包括这里使用的所有代码样本,可以在GitHub项目中找到。