How to Read a Large File Efficiently with Java – 如何用Java有效地读取一个大文件

最后修改: 2013年 12月 26日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

This tutorial will show how to read all the lines from a large file in Java in an efficient manner.

本教程将展示如何以高效的方式从一个大文件中读取所有行

This article is part of the “Java – Back to Basic” tutorial here on Baeldung.

本文是Java – Back to Basic“教程的一部分,在Baeldung这里。

2. Reading in Memory

2.记忆中的阅读

The standard way of reading the lines of the file is in memory – both Guava and Apache Commons IO provide a quick way to do just that:

读取文件行数的标准方式是在内存中–Guava和Apache Commons IO都提供了一种快速的方式来实现。

Files.readLines(new File(path), Charsets.UTF_8);
FileUtils.readLines(new File(path));

The problem with this approach is that all the file lines are kept in memory – which will quickly lead to OutOfMemoryError if the File is large enough.

这种方法的问题是,所有的文件行都保存在内存中–如果文件足够大,这将很快导致OutOfMemoryError

For example – reading a ~1Gb file:

例如 – 阅读一个~1Gb的文件

@Test
public void givenUsingGuava_whenIteratingAFile_thenWorks() throws IOException {
    String path = ...
    Files.readLines(new File(path), Charsets.UTF_8);
}

This starts off with a small amount of memory being consumed: (~0 Mb consumed)

这开始时消耗了少量的内存。(~0 Mb consumed)

[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 128 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 116 Mb

However, after the full file has been processed, we have at the end: (~2 Gb consumed)

然而,在整个文件被处理之后,我们在最后有。(~2 Gb 消耗)

[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 2666 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 490 Mb

Which means that about 2.1 Gb of memory are consumed by the process – the reason is simple – the lines of the file are all being stored in memory now.

这意味着大约2.1Gb的内存被进程消耗了–原因很简单–文件的行数现在都被储存在内存中。

It should be obvious by this point that keeping in memory the contents of the file will quickly exhaust the available memory – regardless of how much that actually is.

在这一点上应该很明显,在内存中保留文件的内容将很快耗尽可用的内存–不管实际有多少。

What’s more, we usually don’t need all of the lines in the file in memory at once – instead, we just need to be able to iterate through each one, do some processing and throw it away. So, this is exactly what we’re going to do – iterate through the lines without holding all of them in memory.

更重要的是,我们通常不需要将文件中的所有行同时放在内存中–相反,我们只需要能够遍历每一行,做一些处理,然后扔掉它。因此,这正是我们要做的–在内存中不保留所有行的情况下迭代这些行。

3. Streaming Through the File

3.通过文件进行流转

Let’s now look at a solution – we’re going to use a java.util.Scanner to run through the contents of the file and retrieve lines serially, one by one:

现在让我们来看看一个解决方案–我们将使用java.util.Scanner来运行文件的内容,并逐个连续地检索行。

FileInputStream inputStream = null;
Scanner sc = null;
try {
    inputStream = new FileInputStream(path);
    sc = new Scanner(inputStream, "UTF-8");
    while (sc.hasNextLine()) {
        String line = sc.nextLine();
        // System.out.println(line);
    }
    // note that Scanner suppresses exceptions
    if (sc.ioException() != null) {
        throw sc.ioException();
    }
} finally {
    if (inputStream != null) {
        inputStream.close();
    }
    if (sc != null) {
        sc.close();
    }
}

This solution will iterate through all the lines in the file – allowing for processing of each line – without keeping references to them – and in conclusion, without keeping them in memory: (~150 Mb consumed)

这个解决方案将遍历文件中的所有行–允许处理每一行–而不保留对它们的引用–最后,不在内存中保留它们(~150 Mb的消耗)

[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Total Memory: 763 Mb
[main] INFO  org.baeldung.java.CoreJavaIoUnitTest - Free Memory: 605 Mb

4. Streaming With Apache Commons IO

4.用Apache Commons IO进行流式传输

The same can be achieved using the Commons IO library as well, by using the custom LineIterator provided by the library:

使用Commons IO库也可以实现同样的效果,方法是使用库中提供的自定义LineIterator

LineIterator it = FileUtils.lineIterator(theFile, "UTF-8");
try {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
} finally {
    LineIterator.closeQuietly(it);
}

Since the entire file is not fully in memory – this will also result in pretty conservative memory consumption numbers: (~150 Mb consumed)

由于整个文件不完全在内存中 – 这也将导致相当保守的内存消耗数字(~150 Mb 消耗)

[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Total Memory: 752 Mb
[main] INFO  o.b.java.CoreJavaIoIntegrationTest - Free Memory: 564 Mb

5. Conclusion

5.结论

This quick article shows how to process lines in a large file without iteratively, without exhausting the available memory – which proves quite useful when working with these large files.

这篇快速文章展示了如何在不迭代的情况下处理大文件中的行,而不耗尽可用的内存–这在处理这些大文件时证明相当有用。

The implementation of all these examples and code snippets can be found in our GitHub project – this is a Maven-based project, so it should be easy to import and run as it is.

所有这些例子和代码片段的实现都可以在我们的GitHub项目中找到–这是一个基于Maven的项目,所以应该很容易导入并按原样运行。