UTF-8 Validation in Java – Java 中的 UTF-8 验证

最后修改: 2023年 12月 28日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In data transmission, we often need to handle byte data. If the data is an encoded string instead of a binary, we often encode it in Unicode. Unicode Transformation Format-8 (UTF-8) is a variable-length encoding that can encode all possible Unicode characters.

在数据传输中,我们经常需要处理字节数据。如果数据是编码字符串而不是二进制,我们通常会将其编码为 Unicode。Unicode Transformation Format-8 (UTF-8) 是一种可变长度编码,可以编码所有可能的 Unicode 字符。

In this tutorial, we’ll explore the conversion between UTF-8 encoded bytes and string. After that, we’ll dive into the crucial aspects of conducting UTF-8 validation on byte data in Java.

在本教程中,我们将探讨 UTF-8 编码字节与字符串之间的转换。之后,我们将深入探讨在 Java 中对字节数据进行 UTF-8 验证的关键问题。

2. UTF-8 Conversion

2.UTF-8 转换

Before we jump into the validation sections, let’s review how to convert a string into a UTF-8 encoded byte array and vice versa.

在进入验证部分之前,让我们回顾一下如何将 字符串转换为 UTF-8 编码字节数组,反之亦然

We can simply call the getBytes() method with the target encoding of a string to convert a string into a byte array:

我们只需使用字符串的目标编码调用 getBytes() 方法,即可将字符串转换为字节数组:

String UTF8_STRING = "Hello 你好";
byte[] UTF8_BYTES = UTF8_STRING.getBytes(StandardCharsets.UTF_8);

For the reverse, the String class provides a constructor to create a String instance by a byte array and its source encoding:

反过来,String 类提供了一个构造函数,用于通过字节数组及其源编码创建 String 实例: String类提供了一个构造函数,用于通过字节数组及其源编码创建 String 实例。

String decodedStr = new String(array, StandardCharsets.UTF_8);

The constructor we used doesn’t have much control over the decoding process. Whenever the byte array contains unmappable character sequences, it replaces those characters with the default replacement character �:

我们使用的构造函数无法控制解码过程。只要字节数组中包含不可应用的字符序列,它就会使用默认替换字符 � 替换这些字符:

@Test
void whenDecodeInvalidBytes_thenReturnReplacementChars() {
    byte[] invalidUtf8Bytes = {(byte) 0xF0, (byte) 0xC1, (byte) 0x8C, (byte) 0xBC, (byte) 0xD1};
    String decodedStr = invalidUtf8Bytes.getBytes(StandardCharsets.UTF_8);
    assertEquals("�����", decodedStr);
}

Therefore, we cannot use this method to validate whether a byte array is encoded in UTF-8.

因此,我们不能使用这种方法来验证字节数组是否以 UTF-8 编码。

3. Byte Array Validation

3.字节数组验证

Java provides a simple way to validate whether a byte array is UTF-8 encoded using CharsetDecoder:

Java 提供了一种使用 CharsetDecoder 验证字节数组是否为 UTF-8 编码的简单方法:

CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
CharBuffer decodedCharBuffer = charsetDecoder.decode(java.nio.ByteBuffer.wrap(UTF8_BYTES));

If the decoding process succeeds, we consider those bytes as valid UTF-8. Otherwise, the decode() method throws MalformedInputException:

如果解码过程成功,我们就将这些字节视为有效的 UTF-8。否则,decode() 方法会抛出MalformedInputException:</em

@Test
void whenDecodeInvalidUTF8Bytes_thenThrowsMalformedInputException() {

    CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
    assertThrows(MalformedInputException.class,() -> {
        charsetDecoder.decode(java.nio.ByteBuffer.wrap(INVALID_UTF8_BYTES));
    });
}

4. Byte Stream Validation

4.字节流验证

When our source data is a byte stream rather than a byte array, we can read the InputStream and put its content into a byte array. Subsequently, we can apply the encoding validation on the byte array.

当源数据是字节流而不是字节数组时,我们可以读取 InputStream 并将其内容放入 字节数组。随后,我们可以在字节数组上应用编码验证。

However, our preference is to directly validate the InputStream. This avoids creating an extra byte array and reduces the memory footprint in our application. It’s particularly important when we process a large stream.

不过,我们更倾向于直接验证 InputStream 。这避免了创建额外的字节数组,并减少了应用程序的内存占用

In this section, we’ll define the following constant as our source UTF-8 encoded InputStream:

在本节中,我们将定义以下常量作为我们的源 UTF-8 编码 InputStream

InputStream UTF8_INPUTSTREAM = new ByteArrayInputStream(UTF8_BYTES);

4.1. Validation Using Apache Tika

4.1.使用 Apache Tika 验证

Apache Tika is an open-source content analysis library that provides a set of classes for detecting and extracting text content from different file formats.

Apache Tika 是一个开源内容分析库,它提供了一组用于从不同文件格式中检测和提取文本内容的类。

We need to include the following Apache Tika core and standard parser dependencies in pom.xml:

我们需要在 pom.xml 中包含以下 Apache Tika core standard parser 依赖项:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>2.9.1</version>
</dependency>
<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers-standard-package</artifactId>
    <version>2.9.1</version>
</dependency>

When we conduct a UTF-8 validation in Apache Tika, we instantiate a UniversalEncodingDetector and use it to detect the encoding of the InputStream. The detector returns the encoding as a Charset instance. We simply verify whether the Charset instance is a UTF-8 one:

当我们在 Apache Tika 中进行 UTF-8 验证时,我们会实例化一个 UniversalEncodingDetector 并使用它来检测 InputStream 的编码。检测器会将编码作为 Charset 实例返回。我们只需验证 Charset 实例是否为 UTF-8 编码即可:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    EncodingDetector encodingDetector = new UniversalEncodingDetector();
    Charset detectedCharset = encodingDetector.detect(UTF8_INPUTSTREAM, new Metadata());
    assertEquals(StandardCharsets.UTF_8, detectedCharset);
}

It’s worth noting that when we detect a stream that contains only the first 128 characters in the ASCII code, the detect() method returns ISO-8859-1 instead of UTF-8.

值得注意的是,当我们检测到一个只包含 ASCII 码中前 128 个字符的流时,detect() 方法会返回 ISO-8859-1 而不是 UTF-8。

ISO-8859-1 is a single-byte encoding to represent ASCII characters, which are the same as the first 128 Unicode characters. Due to this characteristic, we still consider the data to be UTF-8 encoded if the method returns ISO-8859-1.

ISO-8859-1 是一种表示 ASCII 字符的单字节编码,与前 128 个 Unicode 字符相同。基于这一特点,如果方法返回 ISO-8859-1,我们仍然认为数据是UTF-8 编码的。

4.2. Validation Using ICU4J

4.2.使用 ICU4J 进行验证

ICU4J stands for International Components for Unicode for Java and is a Java library published by IBM. It provides Unicode and globalization support for software applications. We need the following ICU4J dependency in our pom.xml:

ICU4J 是 International Components for Unicode for Java 的缩写,是由 IBM 发布的一个 Java 库。它为软件应用程序提供 Unicode 和全球化支持。我们需要在 pom.xml 中加入以下 ICU4J 依赖关系:

<dependency>
    <groupId>com.ibm.icu</groupId>
    <artifactId>icu4j</artifactId>
    <version>74.1</version>
</dependency>

In ICU4J, we create a CharsetDetector instance to detect the charset of the InputStream. Similar to the validation using Apache Tika, we verify whether the charset is UTF-8 or not:

在 ICU4J 中,我们创建了一个 CharsetDetector 实例来检测 InputStream 的字符集。与使用 Apache Tika 进行的验证类似,我们会验证字符集是否为 UTF-8:

@Test
void whenDetectEncoding_thenReturnsUtf8() {
    CharsetDetector detector = new CharsetDetector();
    detector.setText(UTF8_INPUTSTREAM);
    CharsetMatch charsetMatch = detector.detect();
    assertEquals(StandardCharsets.UTF_8.name(), charsetMatch.getName());
}

ICU4J exhibits the same behavior when it detects the encoding of the stream where the detection returns ISO-8859-1 when the data contains only the first 128 ASCII characters.

ICU4J 在检测数据流的编码时也表现出同样的行为,当数据只包含前 128 个 ASCII 字符时,检测会返回 ISO-8859-1。

5. Conclusion

5.结论

In this article, we’ve explored UTF-8 encoded bytes and string conversion and different types of UTF-8 validation based on byte and stream. This journey equips us with practical code to foster a deeper understanding of UTF-8 in Java applications.

在本文中,我们探讨了 UTF-8 编码字节和字符串转换,以及基于字节和流的不同类型 UTF-8 验证。这段旅程为我们提供了实用代码,帮助我们加深对 Java 应用程序中 UTF-8 的理解。

As always, the sample code is available over on GitHub.

与往常一样,示例代码可在 GitHub 上获取。