Guide to Character Encoding – 字符编码指南

最后修改: 2018年 12月 1日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

In this tutorial, we’ll discuss the basics of character encoding and how we handle it in Java.

在本教程中,我们将讨论字符编码的基础知识以及我们如何在Java中处理它。

2. Importance of Character Encoding

2.字符编码的重要性

We often have to deal with texts belonging to multiple languages with diverse writing scripts like Latin or Arabic. Every character in every language needs to somehow be mapped to a set of ones and zeros. Really, it’s a wonder that computers can process all of our languages correctly.

我们经常要处理属于多种语言的文本,这些文本具有不同的书写脚本,如拉丁语或阿拉伯语。每种语言中的每个字符都需要以某种方式被映射成一组1和0。真的,计算机能正确处理我们所有的语言真是个奇迹。

To do this properly, we need to think about character encoding. Not doing so can often lead to data loss and even security vulnerabilities.

要做到这一点,我们需要考虑字符编码问题。不这样做往往会导致数据丢失,甚至出现安全漏洞。

To understand this better, let’s define a method to decode a text in Java:

为了更好地理解这一点,让我们在Java中定义一个方法来解码一个文本。

String decodeText(String input, String encoding) throws IOException {
    return 
      new BufferedReader(
        new InputStreamReader(
          new ByteArrayInputStream(input.getBytes()), 
          Charset.forName(encoding)))
        .readLine();
}

Note that the input text we feed here uses the default platform encoding.

注意,我们在这里输入的文本使用默认的平台编码。

If we run this method with input as “The façade pattern is a software design pattern.” and encoding as “US-ASCII”, it’ll output:

如果我们运行这个方法,将input设为 “The façade pattern is a software design pattern.”,将encoding设为 “US-ASCII”,它将输出。

The fa��ade pattern is a software design pattern.

Well, not exactly what we expected.

好吧,不完全是我们预期的那样。

What could have gone wrong? We’ll try to understand and correct this in the rest of this tutorial.

可能出了什么问题?我们将在本教程的其余部分尝试了解并纠正这个问题。

3. Fundamentals

3.基础知识

Before digging deeper, though, let’s quickly review three terms: encodingcharsets, and code point.

在深入研究之前,让我们快速回顾一下三个术语。编码字符集代码点

3.1. Encoding

3.1.编码

Computers can only understand binary representations like 1 and 0. Processing anything else requires some kind of mapping from the real-world text to its binary representation. This mapping is what we know as character encoding or simply just as encoding.

计算机只能理解像10的二进制表示。处理其他任何东西都需要从现实世界的文本到其二进制表示的某种映射。这种映射就是我们所知道的字符编码或简单地称为编码

For example, the first letter in our message, “T”, in US-ASCII encodes to “01010100”.

例如,我们信息中的第一个字母,”T”,在US-ASCII中编码为 “01010100”。

3.2. Charsets

3.2.字符集

The mapping of characters to their binary representations can vary greatly in terms of the characters they include. The number of characters included in a mapping can vary from only a few to all the characters in practical use. The set of characters that are included in a mapping definition is formally called a charset.

字符与二进制表示法的映射在所包含的字符方面可以有很大的不同。一个映射中包含的字符数量可以从只有几个到实际使用的所有字符。映射定义中包含的字符集正式称为字符集

For example, ASCII has a charset of 128 characters.

例如,ASCII的字符集为128个字符

3.3. Code Point

3.3.代码点

A code point is an abstraction that separates a character from its actual encoding. A code point is an integer reference to a particular character.

代码点是一个抽象的概念,它将一个字符与它的实际编码分开。一个码位是对一个特定字符的整数引用。

We can represent the integer itself in plain decimal or alternate bases like hexadecimal or octal. We use alternate bases for the ease of referring large numbers.

我们可以用普通的十进制或其他基数如十六进制或八进制来表示整数本身。我们使用交替基数是为了方便引用大数字。

For example, the first letter in our message, T, in Unicode has a code point “U+0054” (or 84 in decimal).

例如,我们信息中的第一个字母T,在Unicode中的码位是 “U+0054″(或十进制的84)。

4. Understanding Encoding Schemes

4.了解编码方案

A character encoding can take various forms depending upon the number of characters it encodes.

一个字符编码可以有多种形式,取决于它所编码的字符数量。

The number of characters encoded has a direct relationship to the length of each representation which typically is measured as the number of bytes. Having more characters to encode essentially means needing lengthier binary representations.

编码的字符数与每个表示法的长度有直接关系,通常以字节数来衡量。有更多的字符需要编码,基本上意味着需要更长的二进制表示。

Let’s go through some of the popular encoding schemes in practice today.

让我们来看看当今一些流行的编码方案的实践。

4.1. Single-Byte Encoding

4.1.单字节编码

One of the earliest encoding schemes, called ASCII (American Standard Code for Information Exchange) uses a single byte encoding scheme. This essentially means that each character in ASCII is represented with seven-bit binary numbers. This still leaves one bit free in every byte!

最早的编码方案之一,称为ASCII(美国信息交换标准代码),采用单字节编码方案。这基本上意味着ASCII中的每个字符都用七位二进制数字表示。这在每个字节中仍留有一位空位

ASCII’s 128-character set covers English alphabets in lower and upper cases, digits, and some special and control characters.

ASCII的128个字符集涵盖了小写和大写的英文字母、数字,以及一些特殊和控制字符。

Let’s define a simple method in Java to display the binary representation for a character under a particular encoding scheme:

让我们在Java中定义一个简单的方法来显示一个字符在特定编码方案下的二进制表示。

String convertToBinary(String input, String encoding) 
      throws UnsupportedEncodingException {
    byte[] encoded_input = Charset.forName(encoding)
      .encode(input)
      .array();  
    return IntStream.range(0, encoded_input.length)
        .map(i -> encoded_input[i])
        .mapToObj(e -> Integer.toBinaryString(e ^ 255))
        .map(e -> String.format("%1$" + Byte.SIZE + "s", e).replace(" ", "0"))
        .collect(Collectors.joining(" "));
}

Now, character ‘T’ has a code point of 84 in US-ASCII (ASCII is referred to as US-ASCII in Java).

现在,字符’T’在US-ASCII中的码位是84(ASCII在Java中被称为US-ASCII)。

And if we use our utility method, we can see its binary representation:

而如果我们使用我们的实用方法,我们可以看到它的二进制表示。

assertEquals(convertToBinary("T", "US-ASCII"), "01010100");

This, as we expected, is a seven-bit binary representation for the character ‘T’.

正如我们所期望的那样,这是对字符 “T “的七位二进制表示。

The original ASCII left the most significant bit of every byte unused. At the same time, ASCII had left quite a lot of characters unrepresented, especially for non-English languages.

最初的ASCII将每个字节中最有意义的比特留作未用。同时,ASCII留下了相当多的字符未被代表,特别是对于非英语语言。

This led to an effort to utilize that unused bit and include an additional 128 characters.

这导致人们努力利用这个未使用的位,并包括一个额外的128个字符。

There were several variations of the ASCII encoding scheme proposed and adopted over the time. These loosely came to be referred to as “ASCII extensions”.

随着时间的推移,ASCII编码方案有几种变化被提出和采用。这些松散的变化被称为 “ASCII扩展”。

Many of the ASCII extensions had different levels of success but obviously, this was not good enough for wider adoption as many characters were still not represented.

许多ASCII的扩展都取得了不同程度的成功,但显然,这还不足以让人们更广泛地采用,因为许多字符仍然没有得到体现。

One of the more popular ASCII extensions was ISO-8859-1, also referred to as “ISO Latin 1”.

比较流行的ASCII扩展是ISO-8859-1,也被称为 “ISO Latin 1″。

4.2. Multi-Byte Encoding

4.2.多字节编码

As the need to accommodate more and more characters grew, single-byte encoding schemes like ASCII were not sustainable.

由于需要容纳越来越多的字符,像ASCII这样的单字节编码方案是不可持续的。

This gave rise to multi-byte encoding schemes which have a much better capacity albeit at the cost of increased space requirements.

这就产生了多字节编码方案,尽管以增加空间要求为代价,但其容量要好得多。

BIG5 and SHIFT-JIS are examples of multi-byte character encoding schemes which started to use one as well as two bytes to represent wider charsets. Most of these were created for the need to represent Chinese and similar scripts which have a significantly higher number of characters.

BIG5和SHIFT-JIS是多字节字符编码方案的例子,它们开始使用一个和两个字节来表示更广泛的字符集。这些方案大多是为了表示中文和类似文字的需要而制定的,这些文字的数量明显较多。

Let’s now call the method convertToBinary with input as ‘語’, a Chinese character, and encoding as “Big5”:

现在让我们调用方法convertToBinaryinput为’語’,一个汉字,encoding为 “Big5″。

assertEquals(convertToBinary("語", "Big5"), "10111011 01111001");

The output above shows that Big5 encoding uses two bytes to represent the character ‘語’.

上面的输出显示,Big5编码使用两个字节来表示 “語 “字。

A comprehensive list of character encodings, along with their aliases, is maintained by the International Number Authority.

一个全面的字符编码列表,以及它们的别名,由国际编码局维护。

5. Unicode

5.Unicode

It is not difficult to understand that while encoding is important, decoding is equally vital to make sense of the representations. This is only possible in practice if a consistent or compatible encoding scheme is used widely.

不难理解,虽然编码很重要,但解码对于理解表征同样至关重要。只有在广泛使用一致或兼容的编码方案的情况下,这在实践中才有可能。

Different encoding schemes developed in isolation and practiced in local geographies started to become challenging.

孤立地开发的不同编码方案和在当地地理上的实践,开始变得具有挑战性。

This challenge gave rise to a singular encoding standard called Unicode which has the capacity for every possible character in the world. This includes the characters which are in use and even those which are defunct!

这一挑战催生了一种名为Unicode的单一编码标准,它可以容纳世界上所有可能的字符。这包括正在使用的字符,甚至包括那些已经失效的字符。

Well, that must require several bytes to store each character? Honestly yes, but Unicode has an ingenious solution.

那么,这肯定需要几个字节来存储每个字符?老实说是的,但Unicode有一个巧妙的解决方案。

Unicode as a standard defines code points for every possible character in the world. The code point for character ‘T’ in Unicode is 84 in decimal. We generally refer to this as “U+0054” in Unicode which is nothing but U+ followed by the hexadecimal number.

Unicode作为一种标准,为世界上每一个可能的字符定义了代码点。字符 “T “在Unicode中的代码点是十进制的84。我们通常把它称为U+0054,在Unicode中,它只不过是U+后面的十六进制数字。

We use hexadecimal as the base for code points in Unicode as there are 1,114,112 points, which is a pretty large number to communicate conveniently in decimal!

我们使用十六进制作为Unicode中代码点的基础,因为有1,114,112个点,这是个相当大的数字,用十进制来交流很方便

How these code points are encoded into bits is left to specific encoding schemes within Unicode. We will cover some of these encoding schemes in the sub-sections below.

这些代码点如何被编码成比特,是由Unicode内的特定编码方案决定的。我们将在下面的小节中介绍其中一些编码方案。

5.1. UTF-32

5.1.UTF-32

UTF-32 is an encoding scheme for Unicode that employs four bytes to represent every code point defined by Unicode. Obviously, it is space inefficient to use four bytes for every character.

UTF-32是Unicode的编码方案,采用四个字节来代表Unicode定义的每个码位。显然,用四个字节表示每个字符的空间效率很低。

Let’s see how a simple character like ‘T’ is represented in UTF-32. We will use the method convertToBinary introduced earlier:

让我们看看像 “T “这样一个简单的字符在UTF-32中是如何表示的。我们将使用前面介绍的convertToBinary方法。

assertEquals(convertToBinary("T", "UTF-32"), "00000000 00000000 00000000 01010100");

The output above shows the usage of four bytes to represent the character ‘T’ where the first three bytes are just wasted space.

上面的输出显示了使用四个字节来表示字符 “T”,其中前三个字节只是浪费了空间。

5.2. UTF-8

5.2 UTF-8

UTF-8 is another encoding scheme for Unicode which employs a variable length of bytes to encode. While it uses a single byte to encode characters generally, it can use a higher number of bytes if needed, thus saving space.

UTF-8是Unicode的另一种编码方案,它采用了可变长度的字节来编码。虽然它一般使用一个字节来编码字符,但如果需要,它可以使用更多的字节数,从而节省空间。

Let’s again call the method convertToBinary with input as ‘T’ and encoding as “UTF-8”:

让我们再次调用方法convertToBinary,输入为’T’,编码为 “UTF-8″。

assertEquals(convertToBinary("T", "UTF-8"), "01010100");

The output is exactly similar to ASCII using just a single byte. In fact, UTF-8 is completely backward compatible with ASCII.

输出的结果与ASCII完全相似,只使用一个字节。事实上,UTF-8是完全向后兼容ASCII的。

Let’s again call the method convertToBinary with input as ‘語’ and encoding as “UTF-8”:

让我们再次调用方法convertToBinary,输入为’語’,编码为 “UTF-8″。

assertEquals(convertToBinary("語", "UTF-8"), "11101000 10101010 10011110");

As we can see here UTF-8 uses three bytes to represent the character ‘語’. This is known as variable-width encoding.

正如我们在这里看到的,UTF-8使用三个字节来表示 “語 “字。这被称为可变宽度编码

UTF-8, due to its space efficiency, is the most common encoding used on the web.

UTF-8由于其空间效率高,是网络上最常用的编码。

6. Encoding Support in Java

6.Java中的编码支持

Java supports a wide array of encodings and their conversions to each other. The class Charset defines a set of standard encodings which every implementation of Java platform is mandated to support.

Java支持大量的编码及其相互之间的转换。类Charset定义了一个标准编码集,Java平台的每个实现都必须支持该编码。

This includes US-ASCII, ISO-8859-1, UTF-8, and UTF-16 to name a few. A particular implementation of Java may optionally support additional encodings.

这包括US-ASCII、ISO-8859-1、UTF-8和UTF-16,仅举几例。Java的特定实现可以选择支持其他编码

There are some subtleties in the way Java picks up a charset to work with. Let’s go through them in more details.

在Java获取字符集的工作方式上有一些微妙的变化。让我们更详细地了解一下。

6.1. Default Charset

6.1.默认字符集

The Java platform depends heavily on a property called the default charset. The Java Virtual Machine (JVM) determines the default charset during start-up.

Java平台在很大程度上依赖于一个名为默认字符集的属性。Java虚拟机(JVM)在启动时决定默认字符集

This is dependent on the locale and the charset of the underlying operating system on which JVM is running. For example on MacOS, the default charset is UTF-8.

这取决于JVM所运行的底层操作系统的locale和charset。例如,在MacOS上,默认的字符集是UTF-8。

Let’s see how we can determine the default charset:

让我们看看如何确定默认的字符集。

Charset.defaultCharset().displayName();

If we run this code snippet on a Windows machine the output we get:

如果我们在Windows机器上运行这个代码片段,我们得到的输出是:。

windows-1252

Now, “windows-1252” is the default charset of the Windows platform in English, which in this case has determined the default charset of JVM which is running on Windows.

现在,”windows-1252 “是Windows平台的英文默认字符集,在这种情况下,它决定了在Windows上运行的JVM的默认字符集。

6.2. Who Uses the Default Charset?

6.2.谁在使用默认字符集?

Many of the Java APIs make use of the default charset as determined by the JVM. To name a few:

许多Java APIs都使用了由JVM决定的默认字符集。仅举几例。

So, this means that if we’d run our example without specifying the charset:

因此,这意味着,如果我们在运行我们的例子时不指定字符集。

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();

then it would use the default charset to decode it.

那么它将使用默认的字符集来解码。

And there are several APIs that make this same choice by default.

而且,有几个API也默认了这种选择。

The default charset hence assumes an importance which we can not safely ignore.

因此,默认字符集的重要性是我们不能安全地忽视的。

6.3. Problems With the Default Charset

6.3.默认字符集的问题

As we have seen that the default charset in Java is determined dynamically when the JVM starts. This makes the platform less reliable or error-prone when used across different operating systems.

我们已经看到,Java中的默认字符集是在JVM启动时动态确定的。这使得该平台在不同的操作系统中使用时不太可靠或容易出错。

For example, if we run

例如,如果我们运行

new BufferedReader(new InputStreamReader(new ByteArrayInputStream(input.getBytes()))).readLine();

on macOS, it will use UTF-8.

在macOS上,它将使用UTF-8。

If we try the same snippet on Windows, it will use Windows-1252 to decode the same text.

如果我们在Windows上尝试相同的片段,它将使用Windows-1252来解码相同的文本。

Or, imagine writing a file on a macOS, and then reading that same file on Windows.

或者,想象一下在macOS上写一个文件,然后在Windows上读取同样的文件。

It’s not difficult to understand that because of different encoding schemes, this may lead to data loss or corruption.

不难理解,由于不同的编码方案,这可能导致数据丢失或损坏。

6.4. Can We Override the Default Charset?

6.4.我们可以覆盖默认的字符集吗?

The determination of the default charset in Java leads to two system properties:

Java中默认字符集的确定导致了两个系统属性。

  • file.encoding: The value of this system property is the name of the default charset
  • sun.jnu.encoding: The value of this system property is the name of the charset used when encoding/decoding file paths

Now, it’s intuitive to override these system properties through command line arguments:

现在,通过命令行参数来覆盖这些系统属性是很直观的。

-Dfile.encoding="UTF-8"
-Dsun.jnu.encoding="UTF-8"

However, it is important to note that these properties are read-only in Java. Their usage as above is not present in the documentation. Overriding these system properties may not have desired or predictable behavior.

然而,需要注意的是,这些属性在Java中是只读的。它们的上述用法并不存在于文档中。重写这些系统属性可能不会有预期或可预测的行为。

Hence, we should avoid overriding the default charset in Java.

因此,我们应该避免重写Java中的默认字符集

6.5. Why Is Java Not Solving This?

6.5.为什么Java不能解决这个问题?

There is a Java Enhancement Proposal (JEP) which prescribes using “UTF-8” as the default charset in Java instead of basing it on locale and operating system charset.

有一个Java Enhancement Proposal(JEP),它规定使用 “UTF-8 “作为Java的默认字符集,而不是基于区域设置和操作系统的字符集。

This JEP is in a draft state as of now and when it (hopefully!) goes through it will solve most of the issues we discussed earlier.

这个JEP目前处于草案状态,当它(希望!)通过时,将解决我们之前讨论的大部分问题。

Note that the newer APIs like those in java.nio.file.Files do not use the default charset. The methods in these APIs read or write character streams with charset as UTF-8 rather than the default charset.

请注意,较新的API,如java.nio.file.Files中的API,不使用默认字符集。这些API中的方法读取或写入字符流,字符集为UTF-8而不是默认的字符集。

6.6. Solving This Problem in Our Programs

6.6.在我们的程序中解决这个问题

We should normally choose to specify a charset when dealing with text instead of relying on the default settings. We can explicitly declare the encoding we want to use in classes which deal with character-to-byte conversions.

在处理文本时,我们通常应该选择指定一个字符集,而不是依赖默认设置。我们可以在处理字符到字节转换的类中明确声明我们要使用的编码。

Luckily, our example is already specifying the charset. We just need to select the right one and let Java do the rest.

幸运的是,我们的例子已经指定了字符集。我们只需要选择正确的,其余的就让Java来做。

We should realize by now that accented characters like ‘ç’ are not present in the encoding schema ASCII and hence we need an encoding which includes them. Perhaps, UTF-8?

我们现在应该意识到,像’ç’这样的重音字符并不存在于ASCII编码模式中,因此我们需要一个包含它们的编码。也许,UTF-8?

Let’s try that, we will now run the method decodeText with the same input but encoding as “UTF-8”:

让我们试试,我们现在将运行方法decodeText,输入相同,但编码为 “UTF-8″。

The façade pattern is a software-design pattern.

Bingo! We can see the output we were hoping to see now.

中奖了!我们现在可以看到我们希望看到的输出。

Here we have set the encoding we think best suits our need in the constructor of InputStreamReader. This is usually the safest method of dealing with characters and byte conversions in Java.

这里我们在InputStreamReader的构造函数中设置了我们认为最适合我们需要的编码。这通常是在Java中处理字符和字节转换的最安全的方法。

Similarly, OutputStreamWriter and many other APIs supports setting an encoding scheme through their constructor.

类似地,OutputStreamWriter和许多其他API支持通过它们的构造函数来设置编码方案。

6.7. MalformedInputException

6.7.MalformedInputException

When we decode a byte sequence, there exist cases in which it’s not legal for the given Charset, or else it’s not a legal sixteen-bit Unicode. In other words, the given byte sequence has no mapping in the specified Charset.

当我们对一个字节序列进行解码时,存在这样的情况:它对于给定的Charset来说是不合法的,或者它不是一个合法的16位Unicode。换句话说,给定的字节序列在指定的Charset中没有映射。

There are three predefined strategies (or CodingErrorAction) when the input sequence has malformed input:

当输入序列有畸形输入时,有三种预定义的策略(或称CodingErrorAction)。

  • IGNORE will ignore malformed characters and resume coding operation
  • REPLACE will replace the malformed characters in the output buffer and resume the coding operation
  • REPORT will throw a MalformedInputException

The default malformedInputAction for the CharsetDecoder is REPORT, and the default malformedInputAction of the default decoder in InputStreamReader is REPLACE.

CharsetDecoder的默认malformedInputAction是REPORT,InputStreamReader中默认解码器的默认malformedInputActionREPLACE。

Let’s define a decoding function that receives a specified Charset, a CodingErrorAction type, and a string to be decoded:

让我们定义一个解码函数,接收一个指定的Charset,一个CodingErrorAction类型,和一个要解码的字符串。

String decodeText(String input, Charset charset, 
  CodingErrorAction codingErrorAction) throws IOException {
    CharsetDecoder charsetDecoder = charset.newDecoder();
    charsetDecoder.onMalformedInput(codingErrorAction);
    return new BufferedReader(
      new InputStreamReader(
        new ByteArrayInputStream(input.getBytes()), charsetDecoder)).readLine();
}

So, if we decode “The façade pattern is a software design pattern.” with US_ASCII, the output for each strategy would be different. First, we use CodingErrorAction.IGNORE which skips illegal characters:

因此,如果我们用US_ASCII解码 “立面模式是一种软件设计模式。”,每个策略的输出都会不同。首先,我们使用CodingErrorAction.IGNORE,跳过非法字符。

Assertions.assertEquals(
  "The faade pattern is a software design pattern.",
  CharacterEncodingExamples.decodeText(
    "The façade pattern is a software design pattern.",
    StandardCharsets.US_ASCII,
    CodingErrorAction.IGNORE));

For the second test, we use CodingErrorAction.REPLACE that puts � instead of the illegal characters:

对于第二个测试,我们使用CodingErrorAction.REPLACE,将�代替非法字符。

Assertions.assertEquals(
  "The fa��ade pattern is a software design pattern.",
  CharacterEncodingExamples.decodeText(
    "The façade pattern is a software design pattern.",
    StandardCharsets.US_ASCII,
    CodingErrorAction.REPLACE));

For the third test, we use CodingErrorAction.REPORT which leads to throwing MalformedInputException:

对于第三个测试,我们使用CodingErrorAction.REPORT,导致抛出MalformedInputException:

Assertions.assertThrows(
  MalformedInputException.class,
    () -> CharacterEncodingExamples.decodeText(
      "The façade pattern is a software design pattern.",
      StandardCharsets.US_ASCII,
      CodingErrorAction.REPORT));

7. Other Places Where Encoding Is Important

7.其他需要编码的地方

We don’t just need to consider character encoding while programming. Texts can go wrong terminally at many other places.

我们在编程时不只是需要考虑字符编码。文本在其他许多地方都可能出现终结性的错误。

The most common cause of problems in these cases is the conversion of text from one encoding scheme to another, thereby possibly introducing data loss.

在这些情况下,最常见的问题原因是文本从一个编码方案转换到另一个编码方案,从而可能引入数据损失。

Let’s quickly go through a few places where we may encounter issues when encoding or decoding text.

让我们快速浏览一下我们在编码或解码文本时可能遇到问题的几个地方。

7.1. Text Editors

7.1.文本编辑器

In most of the cases, a text editor is where texts originate. There are numerous text editors in popular choice including vi, Notepad, and MS Word. Most of these text editors allow for us to select the encoding scheme. Hence, we should always make sure they are appropriate for the text we are handling.

在大多数情况下,文本编辑器是文本的发源地。有许多流行的文本编辑器,包括vi、记事本和MS Word。这些文本编辑器中的大多数允许我们选择编码方案。因此,我们应该始终确保它们适合于我们正在处理的文本。

7.2. File System

7.2.文件系统

After we create texts in an editor, we need to store them in some file system. The file system depends on the operating system on which it is running. Most operating systems have inherent support for multiple encoding schemes. However, there may still be cases where an encoding conversion leads to data loss.

我们在编辑器中创建文本后,需要将它们存储在一些文件系统中。文件系统取决于它所运行的操作系统。大多数操作系统对多种编码方案有固有的支持。然而,仍有可能出现编码转换导致数据丢失的情况。

7.3. Network

7.3.网络

Texts when transferred over a network using a protocol like File Transfer Protocol (FTP) also involve conversion between character encodings. For anything encoded in Unicode, it’s safest to transfer over as binary to minimize the risk of loss in conversion. However, transferring text over a network is one of the less frequent causes of data corruption.

使用文件传输协议(FTP)等协议在网络上传输文本时,也涉及到字符编码之间的转换。对于任何以Unicode编码的东西,最安全的做法是以二进制方式传输,以尽量减少转换中的损失风险。然而,通过网络传输文本是造成数据损坏的一个较少的原因。

7.4. Databases

7.4.数据库

Most of the popular databases like Oracle and MySQL support the choice of the character encoding scheme at the installation or creation of databases. We must choose this in accordance with the texts we expect to store in the database. This is one of the more frequent places where the corruption of text data happens due to encoding conversions.

大多数流行的数据库如Oracle和MySQL支持在安装或创建数据库时选择字符编码方案。我们必须根据我们期望在数据库中存储的文本来选择。这也是由于编码转换而导致文本数据损坏的比较频繁的地方之一。

7.5. Browsers

7.5.浏览器

Finally, in most web applications, we create texts and pass them through different layers with the intention to view them in a user interface, like a browser. Here as well it is imperative for us to choose the right character encoding which can display the characters properly. Most popular browsers like Chrome, Edge allow choosing the character encoding through their settings.

最后,在大多数网络应用程序中,我们创建文本并通过不同的层,目的是在用户界面上查看它们,如浏览器。在这里,我们也必须选择正确的字符编码,以正确显示这些字符。大多数流行的浏览器,如Chrome,Edge,允许通过他们的设置选择字符编码。

8. Conclusion

8.结论

In this article, we discussed how encoding can be an issue while programming.

在这篇文章中,我们讨论了编码如何在编程时成为一个问题。

We further discussed the fundamentals including encoding and charsets. Moreover, we went through different encoding schemes and their uses.

我们进一步讨论了包括编码和字符集在内的基础知识。此外,我们还讨论了不同的编码方案和它们的用途。

We also picked up an example of incorrect character encoding usage in Java and saw how to get that right. Finally, we discussed some other common error scenarios related to character encoding.

我们还捡了一个在Java中不正确使用字符编码的例子,并看了如何把它弄正确。最后,我们讨论了其他一些与字符编码有关的常见错误情况。

As always, the code for the examples is available over on GitHub.

像往常一样,这些例子的代码可以在GitHub上找到over