1. Introduction
1.绪论
Programmers often come across algorithms involving splitting strings. In a special scenario, there might be a requirement to split a string based on single or multiple distinct delimiters and also return the delimiters as part of the split operation.
程序员经常会遇到涉及分割字符串的算法。在一个特殊的情况下,可能需要根据单个或多个不同的分隔符来分割字符串,并且还需要返回分隔符作为分割操作的一部分。
Let’s discuss in detail the different available solutions to this String split problem.
让我们详细讨论一下这个字符串分割问题的不同可用解决方案。
2. Fundamentals
2.基本原理
The Java universe offers quite a few libraries (java.lang.String, Guava, and Apache Commons, to name a few) to facilitate the splitting of strings in simple and fairly complex cases. Additionally, the feature-rich regular expressions provide extra flexibility in splitting problems that revolve around the matching of a specific pattern.
Java世界提供了相当多的库(java.lang.String、Guava和Apache Commons等),以方便在简单和相当复杂的情况下进行字符串的分割。此外,功能丰富的regular expressions在围绕特定模式的匹配的分割问题上提供了额外的灵活性。
3. Look-Around Assertions
3.环顾四周的断言
In regular expressions, look-around assertions indicate that a match is possible either by looking ahead (lookahead) or looking behind (lookbehind) for another pattern, at the current location of the source string. Let’s understand this better with an example.
在正则表达式中,环顾断言表明,通过在源字符串的当前位置向前看(lookahead)或向后看(lookbehind)另一个模式,可以实现匹配。让我们通过一个例子来更好地理解这一点。
A lookahead assertion Java(?=Baeldung) matches “Java” only if it is followed by “Baeldung”.
一个lookahead断言Java(?=Baeldung) 只有在“Baeldung”之后才匹配“Java”。
Likewise, a negative lookbehind assertion (?<!#)\d+ matches a number only if it is not preceded by ‘#’.
同样,一个负的lookbehind断言(?<!#)/d+只有在前面没有’#’的情况下才会匹配一个数字。
Let’s use such look-around assertion regular expressions and devise a solution to our problem.
让我们使用这样的查找断言正则表达式,设计出一个解决我们问题的方案。
In all of the examples explained in this article, we’re going to use two simple Strings:
在本文解释的所有例子中,我们将使用两个简单的Strings。
String text = "Hello@World@This@Is@A@Java@Program";
String textMixed = "@HelloWorld@This:Is@A#Java#Program";
4. Using String.split()
4.使用String.split()
Let’s begin by using the split() method from the String class of the core Java library.
让我们首先使用Java核心库的String类中的split()方法。
Moreover, we’ll evaluate appropriate lookahead assertions, lookbehind assertions, and combinations of them to split the strings as desired.
此外,我们将评估适当的前瞻断言、后瞻断言以及它们的组合,以按需要分割字符串。
4.1. Positive Lookahead
4.1.正面展望
First of all, let’s use the lookahead assertion “((?=@))” and split the string text around its matches:
首先,让我们使用lookahead断言“((?=@))”并将字符串text在其匹配处分割。
String[] splits = text.split("((?=@))");
The lookahead regex splits the string by a forward match of the “@” symbol. The content of the resulting array is:
lookahead regex通过向前匹配“@”符号来分割字符串。结果数组的内容是。
[Hello, @World, @This, @Is, @A, @Java, @Program]
Using this regex doesn’t return the delimiters separately in the splits array. Let’s try an alternate approach.
使用这个regex并不能在splits数组中单独返回分隔符。让我们试试另一种方法。
4.2. Positive Lookbehind
4.2.积极展望未来
We can also use a positive lookbehind assertion “((?<=@))” to split the string text:
我们也可以使用正向的lookbehind断言“((?<=@))”来分割字符串text。
String[] splits = text.split("((?<=@))");
However, the resulting output still won’t contain the delimiters as individual elements of the array:
然而,输出结果仍然不包含作为数组中单个元素的分隔符。
[Hello@, World@, This@, Is@, A@, Java@, Program]
4.3. Positive Lookahead or Lookbehind
4.3.正面展望或展望后方
We can use the combination of the above two explained look-arounds with a logical-or and see it in action.
我们可以使用上述两个解释过的look-arounds与logical-or的组合,看看它的作用。
The resulting regex “((?=@)|(?<=@))” will definitely give us the desired results. The below code snippet demonstrates this:
由此产生的regex “((?=@)|(?<=@))”肯定会给我们带来想要的结果。下面的代码片断演示了这一点。
String[] splits = text.split("((?=@)|(?<=@))");
The above regular expression splits the string, and the resulting array contains the delimiters:
上面的正则表达式分割了字符串,产生的数组包含定界符。
[Hello, @, World, @, This, @, Is, @, A, @, Java, @, Program]
Now that we understand the required look-around assertion regular expression, we can modify it based on the different types of delimiters present in the input string.
现在我们了解了所需的查找断言正则表达式,我们可以根据输入字符串中存在的不同类型的分隔符来修改它。
Let’s attempt to split the textMixed as defined previously using a suitable regex:
让我们尝试用一个合适的重码来分割前面定义的textMixed。
String[] splitsMixed = textMixed.split("((?=:|#|@)|(?<=:|#|@))");
It would not be surprising to see the below results after executing the above line of code:
在执行上述一行代码后,看到下面的结果也就不足为奇了。
[@, HelloWorld, @, This, :, Is, @, A, #, Java, #, Program]
5. Using Guava Splitter
5.使用GuavaSplitter
Considering that now we have clarity on the regex assertions discussed in the above section, let’s delve into a Java library offered by Google.
考虑到现在我们已经清楚了上节讨论的重词断言,让我们深入研究一下谷歌提供的一个Java库。
The Splitter class from Guava offers methods on() and onPattern() to split a string using a regular expression pattern as a separator.
来自Guava的Splitter类提供了on()和onPattern()方法,以使用正则表达式模式作为分隔符来分割字符串。
To start with, let’s see them in action on the string text containing a single delimiter “@”:
首先,让我们看看它们在包含单一分隔符text 的字符串上的作用。
List<String> splits = Splitter.onPattern("((?=@)|(?<=@))").splitToList(text);
List<String> splits2 = Splitter.on(Pattern.compile("((?=@)|(?<=@))")).splitToList(text);
The results from executing the above lines of code are quite similar to the ones generated by the split method, except we now have Lists instead of arrays.
执行上述几行代码的结果与split方法产生的结果非常相似,只是我们现在有Lists而不是数组。
Likewise, we can also use these methods to split a string containing multiple distinct delimiters:
同样,我们也可以使用这些方法来分割一个包含多个不同定界符的字符串。
List<String> splitsMixed = Splitter.onPattern("((?=:|#|@)|(?<=:|#|@))").splitToList(textMixed);
List<String> splitsMixed2 = Splitter.on(Pattern.compile("((?=:|#|@)|(?<=:|#|@))")).splitToList(textMixed);
As we can see, the difference between the above two methods is quite noticeable.
我们可以看到,上述两种方法之间的差异是相当明显的。
The on() method accepts an argument of java.util.regex.Pattern, whereas the onPattern() method just accepts the separator regex as a String.
on()方法接受java.util.regex.Pattern的参数,而onPattern()方法只是接受分隔符regex作为String。
6. Using Apache Commons StringUtils
6.使用Apache Commons的StringUtils
We can also take advantage of the Apache Commons Lang project’s StringUtils method splitByCharacterType().
我们还可以利用Apache Commons Lang项目的StringUtils方法splitByCharacterType()。
It’s really important to note that this method works by splitting the input string by the character type as returned by java.lang.Character.getType(char). Here, we don’t get to pick or extract the delimiters of our choosing.
需要注意的是,这个方法的工作原理是通过 java.lang.Character.getType(char)返回的字符类型对输入字符串进行分割。在这里,我们不能挑选或提取我们所选择的定界符。。
Furthermore, it delivers the best results when the source string has a constant case, either upper or lower, throughout:
此外,当源字符串的大小写不变时,它能提供最好的结果,无论是大写还是小写。
String[] splits = StringUtils.splitByCharacterType("pg@no;10@hello;world@this;is@a#10words;Java#Program");
The different character types as seen in the above string are uppercase and lowercase letters, digits, and special characters (@ ; # ).
在上述字符串中看到的不同字符类型是大写和小写字母、数字和特殊字符(@ ; # )。
Hence, the resulting array splits, as expected, looks like:
因此,产生的数组splits,如预期,看起来像。
[pg, @, no, ;, 10, @, hello, ;, world, @, this, ;, is, @, a, #, 10, words, ;, J, ava, #, P, rogram]
7. Conclusion
7.结语
In this article, we’ve seen how to split a string in such a way that the delimiters are also available in the resulting array.
在这篇文章中,我们已经看到了如何以这样的方式分割字符串,使分隔符也能在结果数组中得到。
First, we discussed look-around assertions and used them to get the desired results. Later, we used the methods provided by the Guava library to achieve similar results.
首先,我们讨论了环顾断言,并使用它们来获得所需的结果。后来,我们使用Guava库提供的方法来实现类似的结果。
Finally, we wrapped up with the Apache Commons Lang library, which provides a more user-friendly method to solve a related problem of splitting a string, also returning the delimiters.
最后,我们用Apache Commons Lang库做了总结,它提供了一种更方便的方法来解决分割字符串的相关问题,同时也返回分隔符。
As always, the code used in this article can be found over on GitHub.
一如既往,本文中所使用的代码可以在GitHub上找到over。