1. Overview
1.概述
In this tutorial, we’ll learn how to split a large file in Java. First, we’ll compare reading files in memory with reading files using streams. Later, we’ll learn to split files based on their size and number.
在本教程中,我们将学习如何用 Java 分割大文件。首先,我们将比较在内存中读取文件和使用流读取文件。随后,我们将学习根据文件大小和数量分割文件。
2. Read File In-Memory vs. Stream
2.读取内存文件与流文件
Whenever we read files in memory, the JVM keeps all the lines in memory. This is a good choice for small files. For large files, however, it frequently results in an OutOfMemoryException.
每当我们在内存中读取文件时,JVM 会将所有行保留在内存中。对于小文件来说,这是一个不错的选择。但对于大文件,它经常会导致 OutOfMemoryException 异常。
Streaming through a file is another way to read it, and there are many ways to stream and read large files. Because the whole file isn’t in memory, it consumes less memory and works well with large files without throwing an exception.
流式读取大文件的方法很多。 由于整个文件不在内存中,因此内存消耗较少,而且在处理大文件时不会出现异常。
For our examples, we’ll use streams to read the large files.
在我们的示例中,我们将使用流来读取大文件。
3. File Split by File Size
3.按文件大小分割文件
While we’ve learned to read large files so far, sometimes we need to split them into smaller files or send them over the network in smaller sizes.
First, we’ll begin by splitting the large file into smaller files, each with a specific size.
For our example, we’ll take one 4.3MB file, largeFile.txt, in our project src/main/resource folder and split it into 1MB each files, and store them under the /target/split directory.
Let’s first get the large file and open an input stream on it:
到目前为止,我们已经学会了读取大文件,但有时我们需要将它们分割成较小的文件,或以较小的大小通过网络发送。
首先,我们将把大文件分割成小文件,每个文件都有特定大小。
在我们的示例中,我们将在项目 src/main/resource文件夹中获取一个 4.3MB 的文件largeFile.txt,并将其拆分成每个 1MB 的文件,然后将它们存储在 /target/split 目录下。
首先,让我们获取大文件并在其上打开输入流:
File largeFile = new File("LARGE_FILE_PATH");
InputStream inputstream = Files.newInputStream(largeFile.toPath());
Here, we’re just loading the file metadata, the large file content isn’t loaded into memory yet.
在这里,我们只是加载文件元数据,大文件内容还没有加载到内存中。
For our example, we’ve got a constant fixed size. In practical use cases, this maxSizeOfSplitFiles value can be dynamically read and changed as per application need.
在我们的示例中,我们有一个恒定的固定大小。在实际使用中,可以根据应用需要动态读取和更改该 maxSizeOfSplitFiles 值。
Now, let’s have a method that takes the largeFile object and a defined maxSizeOfSplitFiles for the split file:
现在,让我们创建一个方法,该方法将接收 largeFile 对象和为分割文件定义的 maxSizeOfSplitFiles :
public List<File> splitByFileSize(File largeFile, int maxSizeOfSplitFiles, String splitFileDirPath)
throws IOException {
// ...
}
Now, let’s create a class SplitLargeFile and splitByFileSize() method:
现在,让我们创建一个类 SplitLargeFile 和 splitByFileSize() 方法:
class SplitLargeFile {
public List<File> splitByFileSize(File largeFile, int maxSizeOfSplitFiles, String splitFileDirPath)
throws IOException {
List<File> listOfSplitFiles = new ArrayList<>();
try (InputStream in = Files.newInputStream(largeFile.toPath())) {
final byte[] buffer = new byte[maxSizeOfSplitFiles];
int dataRead = in.read(buffer);
while (dataRead > -1) {
File splitFile = getSplitFile(FilenameUtils.removeExtension(largeFile.getName()),
buffer, dataRead, splitFileDirPath);
listOfSplitFiles.add(splitFile);
dataRead = in.read(buffer);
}
}
return listOfSplitFiles;
}
private File getSplitFile(String largeFileName, byte[] buffer, int length, String splitFileDirPath)
throws IOException {
File splitFile = File.createTempFile(largeFileName + "-", "-split", new File(splitFileDirPath));
try (FileOutputStream fos = new FileOutputStream(splitFile)) {
fos.write(buffer, 0, length);
}
return splitFile;
}
}
Using maxSizeOfSplitFiles, we can specify how many bytes each smaller chunked file can be.
The maxSizeOfSplitFiles amount of data will be loaded into memory, processed, and made into a small file. We then get rid of it. We read the next set of maxSizeOfSplitFiles data. This ensures that no OutOfMemoryException is thrown.
As a final step, the method returns a list of split files stored under the splitFileDirPath.
We can store the split file in any temporary directory or any custom directory.
Now, let’s test this:
使用maxSizeOfSplitFiles,我们可以指定每个较小的分块文件的字节数。
maxSizeOfSplitFiles数量的数据将被加载到内存中,经过处理后成为一个小文件。然后我们将其删除。我们读取下一组 maxSizeOfSplitFiles 数据。这将确保不会抛出 OutOfMemoryException 异常。
最后,该方法将返回存储在 splitFileDirPath 下的分割文件列表。
我们可以将分割文件存储在任何临时目录或任何自定义目录中。
现在,让我们来测试一下:
public class SplitLargeFileUnitTest {
@BeforeClass
static void prepareData() throws IOException {
Files.createDirectories(Paths.get("target/split"));
}
private String splitFileDirPath() throws Exception {
return Paths.get("target").toString() + "/split";
}
private Path largeFilePath() throws Exception {
return Paths.get(this.getClass().getClassLoader().getResource("largeFile.txt").toURI());
}
@Test
void givenLargeFile_whenSplitLargeFile_thenSplitBySize() throws Exception {
File input = largeFilePath().toFile();
SplitLargeFile slf = new SplitLargeFile();
slf.splitByFileSize(input, 1024_000, splitFileDirPath());
}
}
Finally, once we test, we can see that the program splits the large file into four files of 1MB and one file of 240KB and puts them under the project target/split directory.
最后,经过测试,我们可以看到程序将大文件分割成四个 1MB 的文件和一个 240KB 的文件,并将它们放在项目 target/split目录下。
4. File Split by File Count
4.按文件数量分割文件
Now, let’s split the given large file into a specified number of smaller files. For this, first, we’ll check if the size of small files will fit or not as per the number of files counted.
现在,让我们将给定的大文件分割成指定数量的小文件。为此,我们首先要检查小文件的大小是否符合所计算的文件数。
Also, we’ll use the same method splitByFileSize() from earlier internally for the actual splitting.
此外,我们将在内部使用之前的方法 splitByFileSize() 进行实际分割.。
Let’s create a method splitByNumberOfFiles():
让我们创建一个方法 splitByNumberOfFiles() :
class SplitLargeFile {
public List<File> splitByNumberOfFiles(File largeFile, int noOfFiles, String splitFileDirPath)
throws IOException {
return splitByFileSize(largeFile, getSizeInBytes(largeFile.length(), noOfFiles), splitFileDirPath);
}
private int getSizeInBytes(long largefileSizeInBytes, int numberOfFilesforSplit) {
if (largefileSizeInBytes % numberOfFilesforSplit != 0) {
largefileSizeInBytes = ((largefileSizeInBytes / numberOfFilesforSplit) + 1) * numberOfFilesforSplit;
}
long x = largefileSizeInBytes / numberOfFilesforSplit;
if (x > Integer.MAX_VALUE) {
throw new NumberFormatException("size too large");
}
return (int) x;
}
}
Now, let’s test this:
现在,让我们来测试一下:
@Test
void givenLargeFile_whenSplitLargeFile_thenSplitByNumberOfFiles() throws Exception {
File input = largeFilePath().toFile();
SplitLargeFile slf = new SplitLargeFile();
slf.splitByNumberOfFiles(input, 3, splitFileDirPath());
}
Finally, once we test, we can see that the program splits the large file into 3 files of 1.4MB and puts it under the project target/split dir.
最后,经过测试,我们可以看到程序将大文件分割成 3 个 1.4MB 的文件,并将其放在项目 target/split 目录下。
5. Conclusion
5.结论
In this article, we saw the differences between reading files in memory and via stream, which helps us choose the appropriate one for any use case. Later, we discussed how to split large files into small files. We then learned about splitting by size and splitting by number of files.
在本文中,我们了解了在内存中读取文件和通过流读取文件之间的区别,这有助于我们在任何使用情况下选择合适的读取方式。随后,我们讨论了如何将大文件拆分成小文件。然后,我们了解了按文件大小分割和按文件数量分割。
As always, the example code used in this article is over on GitHub.
与往常一样,本文中使用的示例代码在 GitHub 上。