1. Overview
1.概述
When processing text containing comma-separated-values, it may be necessary to ignore commas that occur in quoted sub-strings.
在处理包含逗号分隔值的文本时,可能需要忽略出现在带引号子串中的逗号。
In this tutorial, we’ll explore different approaches for ignoring commas inside quotes when splitting a comma-separated String.
在本教程中,我们将探讨在分割逗号分隔的String时忽略引号内逗号的不同方法。
2. Problem Statement
2 问题陈述
Suppose we need to split the following comma-separated input:
假设我们需要分割以下逗号分隔的输入。
String input = "baeldung,tutorial,splitting,text,\"ignoring this comma,\"";
After splitting this input and printing the result, we’d expect the following output:
在分割这个输入并打印出结果后,我们会期待以下输出。
baeldung
tutorial
splitting
text
"ignoring this comma,"
In other words, we cannot consider all comma characters as being separators. We must ignore the commas that occur inside quoted sub-strings.
换句话说,我们不能把所有的逗号字符都看作是分隔符。我们必须忽略出现在引号子串内的逗号。
3. Implementing a Simple Parser
3.实现一个简单的解析器
Let’s create a simple parsing algorithm:
让我们创建一个简单的解析算法。
List<String> tokens = new ArrayList<String>();
int startPosition = 0;
boolean isInQuotes = false;
for (int currentPosition = 0; currentPosition < input.length(); currentPosition++) {
if (input.charAt(currentPosition) == '\"') {
isInQuotes = !isInQuotes;
}
else if (input.charAt(currentPosition) == ',' && !isInQuotes) {
tokens.add(input.substring(startPosition, currentPosition));
startPosition = currentPosition + 1;
}
}
String lastToken = input.substring(startPosition);
if (lastToken.equals(",")) {
tokens.add("");
} else {
tokens.add(lastToken);
}
Here, we start by defining a List called tokens, which is responsible for storing all the comma-separated values.
在这里,我们首先定义了一个名为tokens的List,它负责存储所有逗号分隔的值。
Next, we iterate over the characters in the input String.
接下来,我们对输入String中的字符进行迭代。
In each loop iteration, we need to check if the current character is a double quote. When a double quote is found, we use the isInQuotes flag to indicate that all upcoming commas after the double quotes should be ignored. The isInQuotes flag will be set false when we find enclosing double-quotes.
在每个循环迭代中,我们需要检查当前字符是否是一个双引号。当发现一个双引号时,我们使用isInQuotes标志来表示所有即将到来的双引号后的逗号应该被忽略。当我们发现封闭的双引号时,isInQuotes标志将被设置为假。
A new token will be added to the tokens list when isInQuotes is false, and we find a comma character. The new token will contain the characters from startPosition until the last position before the comma character.
当isInQuotes为false,并且我们找到一个逗号字符时,一个新的标记将被添加到tokens列表中。新的标记将包含从startPosition到逗号字符之前的最后位置的字符。
Then, the new startPosition will be the position after the comma character.
然后,新的startPosition将是逗号字符之后的位置。
Finally, after the loop, we’ll still have the last token that goes from startPosition to the last position of the input. Therefore, we use the substring() method to get it. If this last token is just a comma, it means that the last token should be an empty string. Otherwise, we add the last token to the tokens list.
最后,在循环之后,我们仍然会有从startPosition到输入的最后一个位置的最后一个token。因此,我们使用substring()方法来获得它。如果这个最后的标记只是一个逗号,这意味着最后的标记应该是一个空字符串。否则,我们将最后一个标记添加到tokens列表中。
Now, let’s test the parsing code:
现在,让我们测试一下解析代码:。
String input = "baeldung,tutorial,splitting,text,\"ignoring this comma,\"";
var matcher = contains("baeldung", "tutorial", "splitting", "text", "\"ignoring this comma,\"");
assertThat(splitWithParser(input), matcher);
Here, we’ve implemented our parsing code in a static method called splitWithParser. Then, in our test, we define a simple test input containing a comma enclosed by double quotes. Next, we use the hamcrest testing framework to create a contains matcher for the expected output. Finally, we use the assertThat testing method to check if our parser returns the expected output.
在这里,我们在一个名为splitWithParser的静态方法中实现了我们的解析代码。然后,在我们的测试中,我们定义了一个简单的测试input,包含一个由双引号包围的逗号。接下来,我们使用hamcrest测试框架来为预期输出创建一个containsmatcher。最后,我们使用assertThat测试方法来检查我们的解析器是否返回预期的输出。
In an actual scenario, we should create more unit tests to verify the behavior of our algorithm with other possible inputs.
在实际情况下,我们应该创建更多的单元测试来验证我们的算法在其他可能的输入下的行为。
4. Applying Regular Expressions
4.应用正则表达式
Implementing a parser is an efficient approach. However, the resulting algorithm is relatively large and complex. Thus, as an alternative, we can use regular expressions.
实施解析器是一种高效的方法。然而,由此产生的算法相对较大且复杂。因此,作为一种替代方法,我们可以使用正则表达式。
Next, we will discuss two possible implementations that rely on regular expressions. Nevertheless, they should be used with caution as their processing time is high compared to the previous approach. Therefore, using regular expressions for this scenario can be prohibitive when processing large volumes of input data.
接下来,我们将讨论两种依靠正则表达式的可能实现方式。然而,应该谨慎使用它们,因为与前一种方法相比,它们的处理时间很高。因此,在处理大量输入数据时,在这种情况下使用正则表达式会让人望而却步。
4.1. String split() Method
4.1.String split()方法
In this first regular expression option, we’ll use the split() method from the String class. This method splits the String around matches of the given regular expression:
在第一个正则表达式选项中,我们将使用split()方法,该方法来自String类。该方法将String与给定的正则表达式相匹配的部分进行分割:。
String[] tokens = input.split(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)", -1);
At first glance, the regular expression may seem highly complex. However, its functionality is relatively simple.
乍一看,正则表达式可能看起来非常复杂。然而,它的功能却相对简单。
In short, using positive lookahead, tells to split around a comma only if there are no double quotes or if there is an even number of double quotes ahead of it.
简而言之,使用positive lookahead,告诉人们只有在没有双引号或其前面有偶数个双引号的情况下才在逗号周围进行分割。
The last parameter of the split() method is the limit. When we provide a negative limit, the pattern is be applied as many times as possible, and the resulting array of tokens can have any length.
split()方法的最后一个参数是限制。当我们提供一个负数的限制时,该模式将被尽可能多地应用,产生的标记数组可以有任何长度。
4.2. Guava’s Splitter Class
4.2.Guava的Splitter类
Another alternative based on regular expressions is the use of the Splitter class from the Guava library:
另一个基于正则表达式的选择是使用Guava库中的Splitter类。
Pattern pattern = Pattern.compile(",(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)");
Splitter splitter = Splitter.on(pattern);
List<String> tokens = splitter.splitToList(input);
Here, we are creating a splitter object based on the same regular expression pattern as before. After creating the splitter, we use the splitToList() method, which returns a List of tokens after splitting the input String.
在这里,我们基于与之前相同的正则表达式模式创建一个splitter对象。创建splitter后,我们使用splitToList()方法,该方法在分割输入的String后返回一个List的标记。
5. Using a CSV Library
5.使用CSV库
Although the alternatives presented are interesting, it may be necessary to use a CSV parsing library such as OpenCSV.
尽管提出的替代方案很有趣,但可能有必要使用CSV解析库,如OpenCSV。
Using a CSV library has the advantage of requiring less effort, as we don’t need to write a parser or a complex regular expression. As a result, our code ends up being less error-prone and easier to maintain.
使用CSV库的好处是需要更少的努力,因为我们不需要编写分析器或复杂的正则表达式。因此,我们的代码最终不容易出错,也更容易维护。
Moreover, a CSV library may be the best approach when we are not sure about the shape of our input. For example, the input may have escaped quotes, which would not be properly handled by previous approaches.
此外,当我们不确定输入的形状时,CSV库可能是最好的方法。例如,输入可能有转义引号,而以前的方法无法正确处理。
To use OpenCSV, we need to include it as a dependency. In a Maven project, we include the opencsv dependency:
要使用OpenCSV,我们需要将其作为一个依赖项。在Maven项目中,我们包括opencsv依赖项。
<dependency>
<groupId>com.opencsv</groupId>
<artifactId>opencsv</artifactId>
<version>4.1</version>
</dependency>
Then, we can use OpenCSV as follows:
然后,我们可以使用OpenCSV,如下所示。
CSVParser parser = new CSVParserBuilder()
.withSeparator(',')
.build();
CSVReader reader = new CSVReaderBuilder(new StringReader(input))
.withCSVParser(parser)
.build();
List<String[]> lines = new ArrayList<>();
lines = reader.readAll();
reader.close();
Using the CSVParserBuilder class, we start by creating a parser with a comma separator. Then, we use the CSVReaderBuilder to create a CSV reader based on our comma-based parser.
使用CSVParserBuilder类,我们首先创建一个带有逗号分隔符的解析器。然后,我们使用CSVReaderBuilder来创建一个基于逗号解析器的CSV阅读器。
In our example, we provide a StringReader as an argument to the CSVReaderBuilder constructor. However, we can use different readers (e.g., a file reader) if required.
在我们的例子中,我们提供一个StringReader作为CSVReaderBuilder构造函数的参数。 然而,如果需要的话,我们可以使用不同的阅读器(例如,一个文件阅读器)。
Finally, we call the readAll() method from our reader object to get a List of String arrays. Since OpenCSV is designed to handle multi-line inputs, each position in the lines list corresponds to a line from our input. Thus, for each line, we have a String array with the corresponding comma-separated values.
最后,我们从我们的阅读器对象中调用readAll()方法,得到一个List的String数组。由于OpenCSV是为处理多行输入而设计的,lines列表中的每个位置都对应于我们输入中的一行。因此,对于每一行,我们有一个String数组,其中有相应的逗号分隔的值。
Unlike previous approaches, with OpenCSV, the double quotes are removed from the generated output.
与之前的方法不同,在OpenCSV中,双引号被从生成的输出中移除。
6. Conclusion
6.结语
In this article, we explored multiple alternatives for ignoring commas in quotes when splitting a comma-separated String. Besides learning how to implement our own parser, we explored the use of regular expressions and the OpenCSV library.
在这篇文章中,我们探讨了在分割逗号分隔的String时忽略引号中的逗号的多种选择。除了学习如何实现我们自己的解析器,我们还探索了正则表达式和OpenCSV库的使用。
As always, the code samples used in this tutorial are available over on GitHub.
一如既往,本教程中所使用的代码样本可在GitHub上获得。