1. Overview
1.概述</span
Sometimes, we need to read raw text from files and clean up messy content by removing line breaks.
有时,我们需要读取文件中的原始文本,并通过删除换行符来清理杂乱的内容。
In this tutorial, we’ll explore various approaches for removing line breaks from files in Java.
在本教程中,我们将探讨在 Java 中删除文件换行符的各种方法。
2. A Word About Line Breaks
2.关于换行
Before we dive into the code for reading from files and removing line breaks, let’s quickly understand the target objects we want to remove: the line breaks.
在深入学习读取文件和删除换行符的代码之前,让我们先快速了解一下我们要删除的目标对象:换行符。
At first glance, it’s pretty straightforward. A line break is a character breaking a line. However, there are different kinds of line breaks. We may fall into pitfalls if we don’t treat them properly. An example can explain it quickly.
乍一看,这很简单。换行符就是一个断行符。不过,换行符也有不同的种类。如果处理不当,我们可能会陷入误区。举个例子就能很快解释清楚。
Let’s say we have two text files, mutiple-line-1.txt and multiple-line-2.txt. Let’s call them file1 and file2. If we open them in IDE’s editor, for example, IntelliJ, both files look the same:
假设我们有两个文本文件:mutiple-line-1.txt 和 multiple-line-2.txt。让我们称它们为 file1 和 file2。如果我们在集成开发环境的编辑器中打开它们,例如IntelliJ,两个文件看起来是一样的:
A,
B,
C,
D,
E,
F
As we can see, each file has six lines, and there is a leading space character on each line from the second line. So, we believe file1 and file2 contain the exact text.
我们可以看到,每个文件都有六行,从第二行开始,每行都有一个前导空格符。因此,我们认为 file1 和 file2 包含完全相同的文本。
However, now let’s print the file content using the cat command with the -n (show line numbers) and -e (show non-printing characters) options:
不过,现在让我们使用cat命令-n(显示行号)和-e(显示非打印字符)选项打印文件内容:
$ cat -ne multiple-line-1.txt
1 A,$
2 B,$
3 C,$
4 D,$
5 E,$
6 F$
file1’s output is the same as we saw in the IntelliJ editor. But file2 looks quite different:
file1 的输出与我们在 IntelliJ 编辑器中看到的相同。但 file2 看起来却大不相同:
$ cat -ne multiple-line-2.txt
1 A,^M B,$
2 C,$
3 D,^M E,$
4 F$
This is because there are three different line breaks:
这是因为 有三种不同的换行符: 1.
- ‘\r’ – CR (Carriage Return), the line break in Mac OS before X
- ‘\n’ – LF (Line Feed), the line break in *nix and Mac OS
- ‘\r\n’ – CRLF, the line break in Windows
cat -e displays CRLF as ‘^M‘. So, we see file2 contains CRLF. Possibly, the file is created in Windows. Depending on requirements, we may want to remove all kinds of line breaks or only line breaks of the current system.
cat -e 将 CRLF 显示为”^M“。因此,我们看到 file2 包含 CRLF。该文件可能是在 Windows 中创建的。根据需要,我们可能要删除所有类型的换行符,或者只删除当前系统的换行符。
Next, we’ll take these two files as examples to see how to read content from them and remove line breaks. For simplicity, we’ll create two helper methods to return each file’s Path:
接下来,我们将以这两个文件为例,看看如何从中读取内容并删除换行符。为简单起见,我们将创建两个辅助方法来返回每个文件的 Path: :</em
Path file1Path() throws Exception {
return Paths.get(this.getClass().getClassLoader().getResource("multiple-line-1.txt").toURI());
}
Path file2Path() throws Exception {
return Paths.get(this.getClass().getClassLoader().getResource("multiple-line-2.txt").toURI());
}
Note that the approaches used in this article require reading the whole text into memory, so be aware of very large files.
请注意,本文中使用的方法需要将整个文本读入内存,因此请注意非常大的文件。。
3. Replacing line.separator With an Empty String
3.用空字符替换 line.separator
The system property line.separator stores the line separator that is specific to the current operating system. Therefore, if we only want to remove line breaks particular to the current system, we can replace line.separator with an empty string. For example, this approach removes all line breaks from file1 on a Linux box:
系统属性 line.separator 存储了当前操作系统特有的换行符。因此,如果我们只想删除当前系统特有的换行符,我们可以将 line.separator 替换为空字符串。例如,这种方法可以删除 Linux 系统中 file1 的所有换行符:
String content = Files.readString(file1Path(), StandardCharsets.UTF_8);
String result = content.replace(System.getProperty("line.separator"), "");
assertEquals("A, B, C, D, E, F", result);
We use the Files class’s readString() method to load file content in a string. Then, we apply the replacement by replace().
我们使用 Files 类的 readString() 方法将文件内容载入字符串。然后,我们通过 replace() 应用替换。
However, the same approach won’t remove all line breaks from file2, as it contains CRLF line breaks:
但是,同样的方法无法删除 file2 中的所有换行符,因为其中包含 CRLF 换行符:
String content = Files.readString(file2Path(), StandardCharsets.UTF_8);
String result = content.replace(System.getProperty("line.separator"), "");
assertNotEquals("A, B, C, D, E, F", result); // <-- NOT equals assertion!
Next, let’s see if we can remove all line breaks system-independently.
接下来,让我们看看能否独立于系统移除所有换行符。
4. Replacing “\n” and “\r” With Empty Strings
4.用空字符串替换”\n “和”\r
We’ve learned all three different line breaks cover “\n” and “\r” characters. Therefore, if we want to remove all line breaks system-independently, we can replace “\n” and “\r” with empty strings:
我们已经了解到,所有三种不同的换行符都涵盖了”\n “和”\r “字符。因此,如果我们想独立于系统移除所有换行符,我们可以用空字符串替换”\n“和”\r“:
String content1 = Files.readString(file1Path(), StandardCharsets.UTF_8);
// file contains CRLF
String content2 = Files.readString(file2Path(), StandardCharsets.UTF_8);
String result1 = content1.replace("\r", "").replace("\n", "");
String result2 = content2.replace("\r", "").replace("\n", "");
assertEquals("A, B, C, D, E, F", result1);
assertEquals("A, B, C, D, E, F", result2);
Of course, we can also use the regex-based replaceAll() method to achieve the same goal. Let’s take file2 as an example to see how it works:
当然,我们也可以使用基于 regex 的 replaceAll() 方法来实现相同的目标。让我们以文件 2 为例,看看它是如何工作的:
String resultReplaceAll = content2.replaceAll("[\\n\\r]", "");
assertEquals("A, B, C, D, E, F", resultReplaceAll);
5. Using readAllLines() and Then join()
5.使用 readAllLines() 然后使用 join()
Let’s recall the two approaches we’ve learned so far. We first read the entire content from a file, then replace the line.separator system property or “\n” and “\r” characters with empty. One commonality between these approaches is that we manually manage the line breaks ourselves.
让我们回顾一下迄今为止学到的两种方法。我们首先从文件中读取全部内容,然后用空字符替换 line.separator 系统属性或”\n “和”\r “字符。这些方法的一个共同点是,我们自己手动管理换行符。
The Files class offers readAllLines() to read the file content into lines and return a list of strings. It’s worth noting that readAllLines() takes all mentioned three line breaks as a line separator. In other words, this method removes all line breaks from the input. What we need to do is join the elements in the returned list.
Files 类提供了 readAllLines() 来读取文件内容的行,并返回字符串列表。值得注意的是,readAllLines() 将所有提到的三个换行符作为行分隔符。换句话说,此方法会删除输入中的所有换行符。我们需要做的是将元素连接到返回的列表中。
The join() method is pretty convenient to join a list or an array of strings:
join() 方法在连接字符串列表或数组时非常方便:
List<String> lines1 = Files.readAllLines(file1Path(), StandardCharsets.UTF_8);
// file contains CRLF
List<String> lines2 = Files.readAllLines(file2Path(), StandardCharsets.UTF_8);
String result1 = String.join("", lines1);
String result2 = String.join("", lines2);
assertEquals("A, B, C, D, E, F", result1);
assertEquals("A, B, C, D, E, F", result2);
6. Conclusion
6.结论</span
In this article, we first discussed the different kinds of line breaks. Then, we explored various approaches to removing line breaks from a file.
在本文中,我们首先讨论了不同类型的换行符。然后,我们探讨了从文件中删除换行符的各种方法。
As always, the complete source code for the examples is available over on GitHub.
一如既往,示例的完整源代码可在 GitHub 上获取。 .