Illegal Character Compilation Error – 非法字符编译错误

最后修改: 2022年 4月 27日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

The illegal character compilation error is a file type encoding error. It’s produced if we use an incorrect encoding in our files when they are created. As result, in languages like Java, we can get this type of error when we try to compile our project. In this tutorial, we’ll describe the problem in detail along with some scenarios where we may encounter it, and then, we’ll present some examples of how to resolve it.

非法字符编译错误是一个文件类型编码错误。如果我们在创建文件时使用了不正确的编码,就会产生这种错误。因此,在像Java这样的语言中,当我们试图编译我们的项目时,我们会得到这种类型的错误。在本教程中,我们将详细描述这个问题,以及我们可能遇到的一些情况,然后,我们将介绍一些如何解决这个问题的例子。

2. Illegal Character Compilation Error

2.非法字符编译错误

2.1. Byte Order Mark (BOM)

2.1 字节顺序标记(BOM)

Before we go into the byte order mark, we need to take a quick look at the UCS (Unicode) Transformation Format (UTF). UTF is a character encoding format that can encode all of the possible character code points in Unicode. There are several kinds of UTF encodings. Among all these, UTF-8 has been the most used.

在我们进入字节顺序标记之前,我们需要快速浏览一下UCS(Unicode)转换格式(UTF)。UTF是一种字符编码格式,可以编码Unicode中所有可能的字符编码点。有几种UTF编码。在所有这些中,UTF-8一直是使用最多的。

UTF-8 uses an 8-bit variable-width encoding to maximize compatibility with ASCII. When we use this encoding in our files, we may find some bytes that represent the Unicode code point. As a result, our files start with a U+FEFF byte order mark (BOM). This mark, correctly used, is invisible. However, in some cases, it could lead to data errors.

UTF-8使用8位可变宽度的编码,以最大限度地与ASCII兼容。当我们在文件中使用这种编码时,我们可能会发现一些代表Unicode码位的字节。因此,我们的文件以U+FEFF字节顺序标记(BOM)开始。这个标记,正确使用,是不可见的。然而,在某些情况下,它可能导致数据错误。

In the UTF-8 encoding, the presence of the BOM is not fundamental. Although it’s not essential, the BOM may still appear in UTF-8 encoded text. The BOM addition could happen either by an encoding conversion or by a text editor that flags the content as UTF-8.

在UTF-8编码中,BOM的存在不是基本的尽管它不是基本的,BOM仍然可能出现在UTF-8编码的文本中。BOM的增加可能是由编码转换或由文本编辑器将内容标记为UTF-8发生的。

Text editors like Notepad on Windows could produce this kind of addition. As a consequence, when we use a Notepad-like text editor to create a code example and try to run it, we could get a compilation error. In contrast, modern IDEs encode created files as UTF-8 without the BOM. The next sections will show some examples of this problem.

像Windows上的记事本这样的文本编辑器可以产生这样的补充。因此,当我们使用类似记事本的文本编辑器来创建一个代码实例并试图运行它时,我们可能会得到一个编译错误。相比之下,现代IDE将创建的文件编码为UTF-8,而没有BOM。接下来的章节将展示这个问题的一些例子。

2.2. Class with Illegal Character Compilation Error

2.2.带有非法字符的类编译错误

Typically, we work with advanced IDEs, but sometimes, we use a text editor instead. Unfortunately, as we’ve learned, some text editors could create more problems than solutions because saving a file with a BOM could lead to a compilation error in Java. The “illegal character” error occurs in the compilation phase, so it’s quite easy to detect. The next example shows us how it works.

通常情况下,我们使用先进的集成开发环境工作,但有时,我们会使用文本编辑器来代替。不幸的是,正如我们所了解的,一些文本编辑器可能会造成更多的问题,而不是解决方案,因为保存一个带有BOM的文件可能会导致Java的编译错误。“非法字符 “错误发生在编译阶段,所以很容易发现。下一个例子向我们展示了它是如何工作的。

First, let’s write a simple class in our text editor, such as Notepad. This class is just a representation – we could write any code to test. Next, we save our file with the BOM to test:

首先,让我们在我们的文本编辑器中写一个简单的类,比如记事本。这个类只是一个代表–我们可以写任何代码来测试。接下来,我们用BOM保存我们的文件来测试。

public class TestBOM {
    public static void main(String ...args){
        System.out.println("BOM Test");
    }
}

Now, when we try to compile this file using the javac command:

现在,当我们试图用javac命令来编译这个文件时。

$ javac ./TestBOM.java

Consequently, we get the error message:

因此,我们得到了错误信息。

public class TestBOM {
 ^
.\TestBOM.java:1: error: illegal character: '\u00bf'
public class TestBOM {
  ^
2 errors

Ideally, to fix this problem, the only thing to do is save the file as UTF-8 without BOM encoding. After that, the problem is solved. We should always check that our files are saved without a BOM.

理想情况下,要解决这个问题,唯一要做的是将文件保存为UTF-8,不使用BOM编码。之后,这个问题就解决了。我们应该经常检查我们的文件是否被保存为无BOM

Another way to fix this issue is with a tool like dos2unix. This tool will remove the BOM and also take care of other idiosyncrasies of Windows text files.

另一个解决这个问题的方法是使用像dos2unix的工具。这个工具将删除BOM,也会处理Windows文本文件的其他特异功能。

3. Reading Files

3.阅读文件

Additionally, let’s analyze some examples of reading files encoded with BOM.

此外,让我们分析一下阅读用BOM编码的文件的一些例子。

Initially, we need to create a file with BOM to use for our test. This file contains our sample text, “Hello world with BOM.” – which will be our expected string. Next, let’s start testing.

最初,我们需要创建一个带有BOM的文件来用于我们的测试。这个文件包含我们的样本文本,”Hello world with BOM.”- 这将是我们的预期字符串。接下来,让我们开始测试。

3.1. Reading Files Using BufferedReader

3.1.使用BufferedReader读取文件

First, we’ll test the file using the BufferedReader class:

首先,我们将使用BufferedReader类来测试文件。

@Test
public void whenInputFileHasBOM_thenUseInputStream() throws IOException {
    String line;
    String actual = "";
    try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
        while ((line = br.readLine()) != null) {
            actual += line;
        }
    }
    assertEquals(expected, actual);
}

In this case, when we try to assert that the strings are equal, we get an error:

在这种情况下,当我们试图断言字符串是相等的,我们得到一个错误

org.opentest4j.AssertionFailedError: expected: <Hello world with BOM.> but was: <Hello world with BOM.>
Expected :Hello world with BOM.
Actual   :Hello world with BOM.

Actually, if we skim the test response, both strings look apparently equal. Even so, the actual value of the string contains the BOM. As result, the strings aren’t equal.

实际上,如果我们略过测试响应,两个字符串看起来显然是相等的。即便如此,字符串的实际值包含BOM。因此,这两个字符串并不相等。

Moreover, a quick fix would be to replace BOM characters:

此外,一个快速解决方法是替换BOM字符

@Test
public void whenInputFileHasBOM_thenUseInputStreamWithReplace() throws IOException {
    String line;
    String actual = "";
    try (BufferedReader br = new BufferedReader(new InputStreamReader(file))) {
        while ((line = br.readLine()) != null) {
            actual += line.replace("\uFEFF", "");
        }
    }
    assertEquals(expected, actual);
}

The replace method clears the BOM from our string, so our test passes. We need to work carefully with the replace method. A huge number of files to process can lead to performance issues.

replace 方法清除了我们字符串中的BOM,所以我们的测试通过了。我们需要谨慎地使用replace方法。要处理的文件数量巨大,会导致性能问题。

3.2. Reading Files Using Apache Commons IO

3.2.使用Apache Commons IO读取文件

In addition, the Apache Commons IO library provides the BOMInputStream class. This class is a wrapper that includes an encoded ByteOrderMark as its first bytes. Let’s see how it works:

此外,Apache Commons IO库提供了BOMInputStreamclass。这个类是一个包装器,它包括一个编码的ByteOrderMark作为其第一个字节。让我们看看它是如何工作的。

@Test
public void whenInputFileHasBOM_thenUseBOMInputStream() throws IOException {
    String line;
    String actual = "";
    ByteOrderMark[] byteOrderMarks = new ByteOrderMark[] { 
      ByteOrderMark.UTF_8, ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE, ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE
    };
    InputStream inputStream = new BOMInputStream(ioStream, false, byteOrderMarks);
    Reader reader = new InputStreamReader(inputStream);
    BufferedReader br = new BufferedReader(reader);
    while ((line = br.readLine()) != null) {
        actual += line;
    }
    assertEquals(expected, actual);
}

The code is similar to previous examples, but we pass the BOMInputStream as a parameter into the InputStreamReader.

代码与之前的例子类似,但我们将BOMInputStream作为参数传入InputStreamReader

3.3. Reading Files Using Google Data (GData)

3.3.使用谷歌数据(GData)读取文件

On the other hand, another helpful library to handle the BOM is Google Data (GData). This is an older library, but it helps manage the BOM inside the files. It uses XML as its underlying format. Let’s see it in action:

另一方面,另一个处理BOM的有用的库是谷歌数据(GData)。这是一个较早的库,但它有助于管理文件中的BOM。它使用XML作为其基础格式。让我们来看看它的作用。

@Test
public void whenInputFileHasBOM_thenUseGoogleGdata() throws IOException {
    char[] actual = new char[21];
    try (Reader r = new UnicodeReader(ioStream, null)) {
        r.read(actual);
    }
    assertEquals(expected, String.valueOf(actual));
}

Finally, as we observed in the previous examples, removing the BOM from the files is important. If we don’t handle it properly in our files, unexpected results will happen when the data is read. That’s why we need to be aware of the existence of this mark in our files.

最后,正如我们在前面的例子中观察到的,从文件中删除BOM是很重要的。如果我们不在文件中正确处理它,在读取数据时就会出现意想不到的结果。这就是为什么我们需要意识到这个标记在我们文件中的存在。

4. Conclusion

4.总结

In this article, we covered several topics regarding the illegal character compilation error in Java. First, we learned what UTF is and how the BOM is integrated into it. Second, we showed a sample class created using a text editor – Windows Notepad, in this case. The generated class threw the compilation error for the illegal character. Finally, we presented some code examples on how to read files with a BOM.

在这篇文章中,我们介绍了关于Java中非法字符编译错误的几个主题。首先,我们了解了什么是UTF以及BOM是如何被整合到其中的。其次,我们展示了一个用文本编辑器–本例中是Windows Notepad–创建的样本类。生成的类因为非法字符而出现了编译错误。最后,我们展示了一些关于如何用BOM读取文件的代码例子。

As usual, all the code used for this example can be found over on GitHub.

像往常一样,这个例子所使用的所有代码都可以在GitHub上找到over