Compare the Content of Two Files in Java – 在Java中比较两个文件的内容

最后修改: 2021年 8月 20日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll review different approaches to determine if the contents of two files are equal. We’ll be using core Java Stream I/O libraries to read the contents of the files and implement basic comparisons.

在本教程中,我们将回顾不同的方法来确定两个文件的内容是否相等。我们将使用核心的Java流I/O库来读取文件的内容并实现基本的比较。

To finish, we’ll review the support provided in Apache Commons I/O to check for content equality of two files.

最后,我们将审查Apache Commons I/O中提供的支持,以检查两个文件的内容是否相等。

2. Byte by Byte Comparison

2.逐个字节的比较

Let’s start with a simple approach to reading the bytes from the two files to compare them sequentially.

让我们从一个简单的方法开始,从两个文件中读取字节,按顺序比较

To speed up reading the files, we’ll use BufferedInputStream. As we’ll see, BufferedInputStream reads large chunks of bytes from the underlying InputStream into an internal buffer. When the client reads all the bytes in the chunk, the buffer reads another block of bytes from the stream.

为了加快文件的读取速度,我们将使用BufferedInputStream。正如我们将看到的,BufferedInputStream从底层的InputStream读取大块的字节到一个内部缓冲区。当客户端读完大块的字节后,缓冲区会从流中读取另一个字节块。

Obviously, using BufferedInputStream is much faster than reading one byte at a time from the underlying stream.

很明显,使用BufferedInputStream要比从底层流中一次读一个字节快得多

Let’s write a method that uses BufferedInputStreams to compare two files:

让我们写一个方法,使用BufferedInputStreams来比较两个文件。

public static long filesCompareByByte(Path path1, Path path2) throws IOException {
    try (BufferedInputStream fis1 = new BufferedInputStream(new FileInputStream(path1.toFile()));
         BufferedInputStream fis2 = new BufferedInputStream(new FileInputStream(path2.toFile()))) {
        
        int ch = 0;
        long pos = 1;
        while ((ch = fis1.read()) != -1) {
            if (ch != fis2.read()) {
                return pos;
            }
            pos++;
        }
        if (fis2.read() == -1) {
            return -1;
        }
        else {
            return pos;
        }
    }
}

We use the try-with-resources statement to ensure that the two BufferedInputStreams are closed at the end of the statement.

我们使用try-with-resources语句来确保两个BufferedInputStreams在语句的最后被关闭。

With the while loop, we read each byte of the first file and compare it with the corresponding byte of the second file. If we find a discrepancy, we return the byte position of the mismatch. Otherwise, the files are identical and the method returns -1L.

通过while循环,我们读取第一个文件的每个字节并与第二个文件的相应字节进行比较。如果我们发现有差异,我们返回不匹配的字节位置。否则,这两个文件是相同的,该方法返回-1L。

We can see that if the files are of different sizes but the bytes of the smaller file match the corresponding bytes of the larger file, then it returns the size in bytes of the smaller file.

我们可以看到,如果文件大小不同,但小文件的字节数与大文件的相应字节数一致,那么它就会返回小文件的字节数。

3. Line by Line Comparison

3.逐行比较

To compare text files, we can do an implementation that reads the files line by line and checks for equality between them.

为了比较文本文件,我们可以做一个实现,逐行读取文件并检查它们之间是否相等

Let’s work with a BufferedReader that uses the same strategy as InputStreamBuffer, copying chunks of data from the file to an internal buffer to speed up the reading process.

让我们使用一个BufferedReader,它使用与InputStreamBuffer相同的策略,将文件中的数据块复制到一个内部缓冲区,以加快读取过程。

Let’s review our implementation:

让我们回顾一下我们的实施。

public static long filesCompareByLine(Path path1, Path path2) throws IOException {
    try (BufferedReader bf1 = Files.newBufferedReader(path1);
         BufferedReader bf2 = Files.newBufferedReader(path2)) {
        
        long lineNumber = 1;
        String line1 = "", line2 = "";
        while ((line1 = bf1.readLine()) != null) {
            line2 = bf2.readLine();
            if (line2 == null || !line1.equals(line2)) {
                return lineNumber;
            }
            lineNumber++;
        }
        if (bf2.readLine() == null) {
            return -1;
        }
        else {
            return lineNumber;
        }
    }
}

The code follows a similar strategy as the previous example. In the while loop, instead of reading bytes, we read a line of each file and check for equality. If all the lines are identical for both files, then we return -1L, but if there’s a discrepancy, we return the line number where the first mismatch is found.

该代码采用了与前一个例子类似的策略。在while循环中,我们不是读取字节,而是读取每个文件的一行并检查是否相等。如果两个文件的所有行都是相同的,那么我们返回-1L,但如果有差异,我们返回发现第一个不匹配的行号。

If the files are of different sizes but the smaller file matches the corresponding lines of the larger file, then it returns the number of lines of the smaller file.

如果文件的大小不同,但较小的文件与较大的文件的相应行数相匹配,则返回较小的文件的行数。

4. Comparing with Files::mismatch

4.与Files::mismatch相比较

The method Files::mismatch, added in Java 12, compares the contents of two files. It returns -1L if the files are identical, and otherwise, it returns the position in bytes of the first mismatch.

方法Files::mismatch,在Java 12中添加,比较两个文件的内容。如果文件是相同的,则返回-1L,否则,则返回第一个不匹配的字节位置。

This method internally reads chunks of data from the files’ InputStreams and uses Arrays::mismatch, introduced in Java 9, to compare them.

这个方法在内部从文件的InputStreams中读取数据块,并使用Java 9中引入的Arrays::mismatch来比较它们

As with our first example, for files that are of different sizes but for which the contents of the small file are identical to the corresponding contents in the larger file, it returns the size (in bytes) of the smaller file.

与我们的第一个例子一样,对于不同大小的文件,但小文件的内容与大文件中的相应内容相同,它返回小文件的大小(以字节计)。

To see examples of how to use this method, please see our article covering the new features of Java 12.

要查看如何使用该方法的示例,请参见我们的文章,其中涉及Java 12的新功能

5. Using Memory Mapped Files

5.使用内存映射的文件

A memory-mapped file is a kernel object that maps the bytes from a disk file to the computer’s memory address space. The heap memory is circumvented, as the Java code manipulates the contents of the memory-mapped files as if we’re directly accessing the memory.

内存映射的文件是一个内核对象,它将磁盘文件的字节映射到计算机的内存地址空间。堆内存被规避了,因为Java代码对内存映射文件的内容进行操作,就像我们直接访问内存一样。

For large files, reading and writing data from memory-mapped files is much faster than using the standard Java I/O library. It’s important that the computer has an adequate amount of memory to handle the job to prevent thrashing.

对于大文件,从内存映射的文件中读写数据要比使用标准的Java I/O库快得多。重要的是,计算机要有足够的内存来处理这项工作,以防止发生抖动。

Let’s write a very simple example that shows how to compare the contents of two files using memory-mapped files:

让我们写一个非常简单的例子,说明如何使用内存映射的文件来比较两个文件的内容。

public static boolean compareByMemoryMappedFiles(Path path1, Path path2) throws IOException {
    try (RandomAccessFile randomAccessFile1 = new RandomAccessFile(path1.toFile(), "r"); 
         RandomAccessFile randomAccessFile2 = new RandomAccessFile(path2.toFile(), "r")) {
        
        FileChannel ch1 = randomAccessFile1.getChannel();
        FileChannel ch2 = randomAccessFile2.getChannel();
        if (ch1.size() != ch2.size()) {
            return false;
        }
        long size = ch1.size();
        MappedByteBuffer m1 = ch1.map(FileChannel.MapMode.READ_ONLY, 0L, size);
        MappedByteBuffer m2 = ch2.map(FileChannel.MapMode.READ_ONLY, 0L, size);

        return m1.equals(m2);
    }
}

The method returns true if the contents of the files are identical, otherwise, it returns false.

如果文件的内容相同,该方法返回true,否则,它返回false

We open the files using the RamdomAccessFile class and access their respective FileChannel to get the MappedByteBuffer. This is a direct byte buffer that is a memory-mapped region of the file. In this simple implementation, we use its equals method to compare in memory the bytes of the whole file in one pass.

我们使用RamdomAccessFile类打开文件,并访问它们各自的FileChannel以获得MappedByteBuffer。这是一个直接的字节缓冲区,是文件的一个内存映射区域。在这个简单的实现中,我们使用其equals方法在内存中一次性比较整个文件的字节数。

6. Using Apache Commons I/O

6.使用Apache Commons I/O

The methods IOUtils::contentEquals and IOUtils::contentEqualsIgnoreEOL compare the contents of two files to determine equality. The difference between them is that contentEqualsIgnoreEOL ignores line feed (\n) and carriage return (\r). The motivation for this is due to operating systems using different combinations of these control characters to define a new line.

方法IOUtils::contentEqualsIOUtils::contentEqualsIgnoreEOL比较两个文件的内容以确定相等。它们的区别在于,contentEqualsIgnoreEOL忽略了换行(\n)和回车(\r)。这样做的动机是由于操作系统使用这些控制字符的不同组合来定义一个新行。

Let’s see a simple example to check for equality:

让我们看一个简单的例子来检查是否平等。

@Test
public void whenFilesIdentical_thenReturnTrue() throws IOException {
    Path path1 = Files.createTempFile("file1Test", ".txt");
    Path path2 = Files.createTempFile("file2Test", ".txt");

    InputStream inputStream1 = new FileInputStream(path1.toFile());
    InputStream inputStream2 = new FileInputStream(path2.toFile());

    Files.writeString(path1, "testing line 1" + System.lineSeparator() + "line 2");
    Files.writeString(path2, "testing line 1" + System.lineSeparator() + "line 2");

    assertTrue(IOUtils.contentEquals(inputStream1, inputStream2));
}

If we want to ignore newline control characters but otherwise check for equality of the contents:

如果我们想忽略换行控制字符,但在其他方面检查内容是否相等。

@Test
public void whenFilesIdenticalIgnoreEOF_thenReturnTrue() throws IOException {
    Path path1 = Files.createTempFile("file1Test", ".txt");
    Path path2 = Files.createTempFile("file2Test", ".txt");

    Files.writeString(path1, "testing line 1 \n line 2");
    Files.writeString(path2, "testing line 1 \r\n line 2");

    Reader reader1 = new BufferedReader(new FileReader(path1.toFile()));
    Reader reader2 = new BufferedReader(new FileReader(path2.toFile()));

    assertTrue(IOUtils.contentEqualsIgnoreEOL(reader1, reader2));
}

7. Conclusion

7.结语

In this article, we’ve covered several ways to implement a comparison of the contents of two files to check for equality.

在这篇文章中,我们已经介绍了几种实现比较两个文件的内容以检查是否相等的方法。

The source code can be found over on GitHub.

源代码可以在GitHub上找到over