Download a File From an URL in Java – 在Java中从一个URL中下载一个文件

最后修改: 2018年 5月 30日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll see several methods that we can use to download a file.

在本教程中,我们将看到可以用来下载文件的几种方法。

We’ll cover examples ranging from the basic usage of Java IO to the NIO package as well as some common libraries like AsyncHttpClient and Apache Commons IO.

我们将涵盖从Java IO的基本用法到NIO包,以及一些常见的库,如AsyncHttpClient和Apache Commons IO的例子。

Finally, we’ll talk about how we can resume a download if our connection fails before the whole file is read.

最后,我们将讨论如果我们的连接在整个文件被读取之前失败,我们如何恢复下载。

2. Using Java IO

2.使用Java IO

The most basic API we can use to download a file is Java IO. We can use the URL class to open a connection to the file we want to download.

我们可以用来下载文件的最基本的API是Java IO>。我们可以使用URL类来打开一个连接到我们要下载的文件。

To effectively read the file, we’ll use the openStream() method to obtain an InputStream:

为了有效地读取文件,我们将使用openStream()方法来获得一个InputStream

BufferedInputStream in = new BufferedInputStream(new URL(FILE_URL).openStream())

When reading from an InputStream, it’s recommended to wrap it in a BufferedInputStream to increase the performance.

当从InputStream读取时,建议将其包裹在BufferedInputStream中以提高性能。

The performance increase comes from buffering. When reading one byte at a time using the read() method, each method call implies a system call to the underlying file system. When the JVM invokes the read() system call, the program execution context switches from user mode to kernel mode and back.

性能的提高来自于缓冲。当使用read()方法一次读取一个字节时,每个方法调用都意味着对底层文件系统的系统调用。当JVM调用read()系统调用时,程序执行上下文会从用户模式切换到内核模式,然后再返回。

This context switch is expensive from a performance perspective. When we read a large number of bytes, the application performance will be poor, due to a large number of context switches involved.

从性能角度来看,这种上下文切换是很昂贵的。当我们读取大量的字节时,由于涉及大量的上下文切换,应用程序的性能会很差。

For writing the bytes read from the URL to our local file, we’ll use the write() method from the FileOutputStream class:

对于将从URL中读取的字节写入我们的本地文件,我们将使用write()方法,该方法来自FileOutputStream类。

try (BufferedInputStream in = new BufferedInputStream(new URL(FILE_URL).openStream());
  FileOutputStream fileOutputStream = new FileOutputStream(FILE_NAME)) {
    byte dataBuffer[] = new byte[1024];
    int bytesRead;
    while ((bytesRead = in.read(dataBuffer, 0, 1024)) != -1) {
        fileOutputStream.write(dataBuffer, 0, bytesRead);
    }
} catch (IOException e) {
    // handle exception
}

When using a BufferedInputStream, the read() method will read as many bytes as we set for the buffer size. In our example, we’re already doing this by reading blocks of 1024 bytes at a time, so BufferedInputStream isn’t necessary.

当使用BufferedInputStream时,read()方法将读取我们为缓冲区大小设置的字节数。在我们的例子中,我们已经通过一次读取1024字节的数据块来做到这一点,所以BufferedInputStream是不必要的。

The example above is very verbose, but luckily, as of Java 7, we have the Files class that contains helper methods for handling IO operations.

上面的例子非常冗长,但幸运的是,从Java 7开始,我们有Files类,它包含处理IO操作的辅助方法。

We can use the Files.copy() method to read all the bytes from an InputStream and copy them to a local file:

我们可以使用Files.copy()方法从InputStream中读取所有字节,并将其复制到一个本地文件。

InputStream in = new URL(FILE_URL).openStream();
Files.copy(in, Paths.get(FILE_NAME), StandardCopyOption.REPLACE_EXISTING);

Our code works well but can be improved. Its main drawback is the fact that the bytes are buffered into memory.

我们的代码运行良好,但还可以改进。它的主要缺点是,字节被缓冲到内存中。

Fortunately, Java offers us the NIO package that has methods to transfer bytes directly between two Channels without buffering.

幸运的是,Java为我们提供了NIO包,它拥有在两个通道之间直接传输字节的方法,无需缓冲。

We’ll go into detail in the next section.

我们将在下一节中详细介绍。

3. Using NIO

3.使用NIO

The Java NIO package offers the possibility to transfer bytes between two Channels without buffering them into the application memory.

Java NIO包提供了在两个通道之间传输字节的可能性,而无需将其缓冲到应用程序内存中。

To read the file from our URL, we’ll create a new ReadableByteChannel from the URL stream:

为了从我们的URL读取文件,我们将从URL流创建一个新的ReadableByteChannel

ReadableByteChannel readableByteChannel = Channels.newChannel(url.openStream());

The bytes read from the ReadableByteChannel will be transferred to a FileChannel corresponding to the file that will be downloaded:

ReadableByteChannel读取的字节将被转移到FileChannel,对应于将要下载的文件。

FileOutputStream fileOutputStream = new FileOutputStream(FILE_NAME);
FileChannel fileChannel = fileOutputStream.getChannel();

We’ll use the transferFrom() method from the ReadableByteChannel class to download the bytes from the given URL to our FileChannel:

我们将使用transferFrom()方法从ReadableByteChannel类中下载字节从给定的URL到我们的FileChannel

fileOutputStream.getChannel()
  .transferFrom(readableByteChannel, 0, Long.MAX_VALUE);

The transferTo() and transferFrom() methods are more efficient than simply reading from a stream using a buffer. Depending on the underlying operating system, the data can be transferred directly from the filesystem cache to our file without copying any bytes into the application memory.

transferTo()transferFrom()方法比简单地使用缓冲区从流中读取更有效率。根据底层操作系统,数据可以直接从文件系统缓存中传输到我们的文件中,而不需要将任何字节复制到应用程序内存中。

On Linux and UNIX systems, these methods use the zero-copy technique that reduces the number of context switches between the kernel mode and user mode.

在Linux和UNIX系统上,这些方法使用了零拷贝技术,减少了内核模式和用户模式之间的上下文切换次数。

4. Using Libraries

4.使用图书馆

We’ve seen in the examples above how to download content from an URL just by using the Java core functionality.

我们在上面的例子中已经看到了如何通过使用Java核心功能从URL中下载内容。

We also can leverage the functionality of existing libraries to ease our work, when performance tweaks aren’t needed.

当不需要进行性能调整时,我们也可以利用现有库的功能来减轻我们的工作。

For example, in a real-world scenario, we’d need our download code to be asynchronous.

例如,在现实世界的情况下,我们需要我们的下载代码是异步的。

We could wrap all the logic into a Callable, or we could use an existing library for this.

我们可以把所有的逻辑包在一个Callable中,或者我们可以使用一个现有的库来做这个。

4.1. AsyncHttpClient

4.1.AsyncHttpClient

AsyncHttpClient is a popular library for executing asynchronous HTTP requests using the Netty framework. We can use it to execute a GET request to the file URL and get the file content.

AsyncHttpClient是一个流行的库,用于使用Netty框架执行异步的HTTP请求。我们可以用它来执行对文件URL的GET请求并获得文件内容。

First, we need to create an HTTP client:

首先,我们需要创建一个HTTP客户端。

AsyncHttpClient client = Dsl.asyncHttpClient();

The downloaded content will be placed into a FileOutputStream:

下载的内容将被放入一个FileOutputStream

FileOutputStream stream = new FileOutputStream(FILE_NAME);

Next, we create an HTTP GET request and register an AsyncCompletionHandler handler to process the downloaded content:

接下来,我们创建一个HTTP GET请求,并注册一个AsyncCompletionHandler处理器来处理下载的内容。

client.prepareGet(FILE_URL).execute(new AsyncCompletionHandler<FileOutputStream>() {

    @Override
    public State onBodyPartReceived(HttpResponseBodyPart bodyPart) 
      throws Exception {
        stream.getChannel().write(bodyPart.getBodyByteBuffer());
        return State.CONTINUE;
    }

    @Override
    public FileOutputStream onCompleted(Response response) 
      throws Exception {
        return stream;
    }
})

Notice that we’ve overridden the onBodyPartReceived() method. The default implementation accumulates the HTTP chunks received into an ArrayList. This could lead to high memory consumption, or an OutOfMemory exception when trying to download a large file.

请注意,我们已经重载了onBodyPartReceived()方法。默认实现将收到的HTTP块累积到一个ArrayList中。这可能导致高内存消耗,或者在试图下载一个大文件时出现OutOfMemory异常。

Instead of accumulating each HttpResponseBodyPart into memory, we use a FileChannel to write the bytes to our local file directly. We’ll use the getBodyByteBuffer() method to access the body part content through a ByteBuffer.

我们没有将每个HttpResponseBodyPart累积到内存中,而是使用一个FileChannel将字节直接写入我们的本地文件。我们将使用getBodyByteBuffer()方法来通过ByteBuffer访问身体部分内容。

ByteBuffers have the advantage that the memory is allocated outside of the JVM heap, so it doesn’t affect our application memory.

ByteBuffers的优势在于,内存是在JVM堆之外分配的,因此它不会影响我们的应用程序内存。

4.2. Apache Commons IO

4.2.Apache Commons IO

Another highly used library for IO operation is Apache Commons IO. We can see from the Javadoc that there’s a utility class named FileUtils that we use for general file manipulation tasks.

另一个高度使用的IO操作库是Apache Commons IO。我们可以从Javadoc中看到,有一个名为FileUtils的实用类,我们用它来完成一般的文件操作任务。

To download a file from an URL, we can use this one-liner:

要从一个URL中下载一个文件,我们可以使用这个单行程序。

FileUtils.copyURLToFile(
  new URL(FILE_URL), 
  new File(FILE_NAME), 
  CONNECT_TIMEOUT, 
  READ_TIMEOUT);

From a performance standpoint, this code is the same as the one from Section 2.

从性能的角度来看,这段代码与第2节的代码相同。

The underlying code uses the same concepts of reading in a loop some bytes from an InputStream and writing them to an OutputStream.

底层代码使用相同的概念,在一个循环中从InputStream中读取一些字节,并将其写入OutputStream

One difference is that here the URLConnection class is used to control the connection time-outs so that the download doesn’t block for a large amount of time:

一个区别是,这里的URLConnection类被用来控制连接超时,这样下载就不会被阻断大量的时间。

URLConnection connection = source.openConnection();
connection.setConnectTimeout(connectionTimeout);
connection.setReadTimeout(readTimeout);

5. Resumable Download

5.可下载简历

Considering internet connections fail from time to time, it’s useful to be able to resume a download, instead of downloading the file again from byte zero.

考虑到互联网连接不时出现故障,能够恢复下载是很有用的,而不是从零字节重新下载文件。

Let’s rewrite the first example from earlier to add this functionality.

让我们重写前面的第一个例子,增加这个功能。

The first thing to know is that we can read the size of a file from a given URL without actually downloading it by using the HTTP HEAD method:

首先要知道的是,我们可以通过使用HTTP HEAD方法从一个给定的URL读取文件的大小,而不需要实际下载

URL url = new URL(FILE_URL);
HttpURLConnection httpConnection = (HttpURLConnection) url.openConnection();
httpConnection.setRequestMethod("HEAD");
long removeFileSize = httpConnection.getContentLengthLong();

Now that we have the total content size of the file, we can check whether our file is partially downloaded.

现在我们有了文件的总内容大小,我们可以检查我们的文件是否被部分下载。

If so, we’ll resume the download from the last byte recorded on disk:

如果是这样,我们将从磁盘上记录的最后一个字节继续下载。

long existingFileSize = outputFile.length();
if (existingFileSize < fileLength) {
    httpFileConnection.setRequestProperty(
      "Range", 
      "bytes=" + existingFileSize + "-" + fileLength
    );
}

Here we’ve configured the URLConnection to request the file bytes in a specific range. The range will start from the last downloaded byte and will end at the byte corresponding to the size of the remote file.

这里我们配置了URLConnection来请求特定范围内的文件字节。该范围将从最后下载的字节开始,并在与远程文件大小相对应的字节结束。

Another common way to use the Range header is for downloading a file in chunks by setting different byte ranges. For example, to download 2 KB file, we can use the range 0 – 1024 and 1024 – 2048.

另一种使用Range头的常用方法是通过设置不同的字节范围来分块下载文件。例如,要下载2KB的文件,我们可以使用0-1024和1024-2048的范围。

Another subtle difference from the code in Section 2 is that the FileOutputStream is opened with the append parameter set to true:

与第2节代码的另一个细微差别是,FileOutputStream被打开时,append参数设置为true

OutputStream os = new FileOutputStream(FILE_NAME, true);

After we’ve made this change, the rest of the code is identical to the one from Section 2.

在我们做了这个改动后,其余的代码与第2节的代码完全相同。

6. Conclusion

6.结论

We’ve seen in this article several ways to download a file from an URL in Java.

在这篇文章中,我们已经看到了用Java从URL中下载文件的几种方法。

The most common implementation is to buffer the bytes when performing the read/write operations. This implementation is safe to use even for large files because we don’t load the whole file into memory.

最常见的实现是在进行读/写操作时对字节进行缓冲。这种实现方式即使对大文件也是安全的,因为我们不会把整个文件加载到内存中。

We’ve also seen how to implement a zero-copy download using Java NIO Channels. This is useful because it minimized the number of context switches done when reading and writing bytes, and by using direct buffers, the bytes are not loaded into the application memory.

我们还看到了如何使用Java NIOChannels实现零拷贝下载。这很有用,因为它最大限度地减少了读写字节时的上下文切换次数,而且通过使用直接缓冲区,字节不会被加载到应用程序内存中。

Also, because downloading a file is usually done over HTTP, we’ve shown how to achieve this using the AsyncHttpClient library.

另外,由于下载文件通常是通过HTTP完成的,我们已经展示了如何使用AsyncHttpClient库来实现这一点。

The source code for the article is available over on GitHub.

文章的源代码可在GitHub上获得