1. Overview
1.概述
Many alphabets contain accent and diacritical marks. To search or index data reliably, we might want to convert a string with diacritics to a string containing only ASCII characters. Unicode defines a text normalization procedure that helps do this.
许多字母包含重音和变音符号。为了可靠地搜索或索引数据,我们可能想把带有变音符的字符串转换成只包含ASCII字符的字符串。Unicode定义了一个文本规范化程序,以帮助实现这一目标。
In this tutorial, we’ll see what Unicode text normalization is, how we can use it to remove diacritical marks, and the pitfalls to watch out for. Then, we will see some examples using the Java Normalizer class and Apache Commons StringUtils.
在本教程中,我们将看到什么是Unicode文本规范化,我们如何使用它来去除变音符号,以及需要注意的陷阱。然后,我们将看到一些使用Java Normalizer类和Apache Commons StringUtils.的例子。
2. The Problem at a Glance
2.问题一目了然
Let’s say that we are working with text containing the range of diacritical marks we want to remove:
假设我们正在处理含有我们想要删除的变音符号范围的文本。
āăąēîïĩíĝġńñšŝśûůŷ
After reading this article, we’ll know how to get rid of diacritics and end up with:
读完这篇文章后,我们就会知道如何摆脱变音符,并最终得到。
aaaeiiiiggnnsssuuy
3. Unicode Fundamentals
3.统一码基础知识
Before jumping straight into code, let’s learn some Unicode basics.
在直接跳入代码之前,让我们学习一些Unicode的基本知识。
To represent a character with a diacritical or accent mark, Unicode can use different sequences of code points. The reason for that is historical compatibility with older characters sets.
为了表示一个带有变音或重音符号的字符,Unicode可以使用不同的码位序列。其原因是与旧的字符集的历史兼容性。
Unicode normalization is the decomposition of characters using equivalence forms defined by the standard.
Unicode规范化是对字符的分解 使用由标准定义的等价形式。
3.1. Unicode Equivalence Forms
3.1.Unicode等效形式
To compare sequences of code points, Unicode defines two terms: canonical equivalence and compatibility.
为了比较代码点的序列,Unicode定义了两个术语。经典的等价性和兼容性。。
Canonically equivalent code points have the same appearance and meaning when displayed. For example, the letter “ś” (Latin letter “s” with acute) can be represented with one code point +U015B or two code points +U0073 (Latin letter “s”) and +U0301 (acute symbol).
在显示的时候,经典的等价代码点具有相同的外观和意义。例如,字母 “ś”(带锐角的拉丁字母 “s”)可以用一个码位+U015B或两个码位+U0073(拉丁字母 “s”)和+U0301(锐角符号)表示。
On the other hand, compatible sequences can have distinct appearances but the same meaning in some contexts. For instance, the code point +U013F (Latin ligature “Ŀ”) is compatible with the sequence +U004C (Latin letter “L”) and +U00B7 (symbol “·”). Moreover, some fonts can show the middle dot inside the L and some following it.
另一方面,兼容的序列可以有不同的外观,但在某些情况下有相同的含义。例如,代码点+U013F(拉丁文连接词 “Ŀ”)与序列+U004C(拉丁字母 “L”)和+U00B7(符号”-“)兼容。此外,有些字体可以在L里面显示中间的点,有些则在它后面。
Canonically equivalent sequences are compatible, but the opposite is not always true.
从规范上讲,等价的序列是兼容的,但相反的情况并不总是如此。
3.2. Character Decomposition
3.2.字符分解
Character decomposition replaces the composite character with code points of a base letter, followed by combining characters (according to the equivalence form). For example, this procedure will decompose the letter “ā” into characters “a” and “-“.
字符分解是用一个基本字母的码位来替换复合字符,然后再组合字符(根据等价形式)。例如,这个程序将把字母 “ā “分解成字符 “a “和”-“。
3.3. Matching Diacritical and Accent Marks
3.3.分音符和重音符的匹配
Once we have separated the base character from the diacritical mark, we must create an expression matching unwanted characters. We can use either a character block or a category.
一旦我们将基本字符与变音符号分开,我们必须创建一个匹配不需要的字符的表达式。我们可以使用一个字符块或一个类别。
The most popular Unicode code block is Combining Diacritical Marks. It is not very large and contains just 112 most common combining characters. On the other side, we can also use the Unicode category Mark. It consists of code points that are combining marks and divide further into three subcategories:
最受欢迎的Unicode代码块是组合假名。它不是很大,只包含112个最常见的组合字符。在另一方面,我们也可以使用Unicode类别Mark。它由组合标记的代码点组成,并进一步划分为三个子类别。
- Nonspacing_Mark: this category includes 1,839 code points
- Enclosing_Mark: contains 13 code points
- Spacing_Combining_Mark: contains 443 points
The major difference between a Unicode character block and a category is that the character block contains a contiguous range of characters. On the other side, a category can have many character blocks. For example, it is precisely the case of Combining Diacritical Marks: all code points belonging to this block are also included in the Nonspacing_Mark category.
Unicode字符块和类别之间的主要区别是,字符块包含一个连续的字符范围。另一方面,一个类别可以有许多字符块。例如,这正是组合变音符号的情况:属于这个块的所有代码点也包括在非间隔符号类别中。
4. Algorithm
4.算法
Now that we understand the base Unicode terms, we can plan the algorithm to remove diacritical marks from a String.
现在我们了解了Unicode的基本术语,我们可以计划一下从String中去除变音符号的算法。
First, we will separate base characters from accent and diacritical marks using the Normalizer class. Moreover, we will perform the compatibility decomposition represented as the Java enum NFKD. Additionally, we use compatibility decomposition because it decomposes more ligatures than the canonical method (for example, ligature “fi”).
首先,我们将使用Normalizer类将基本字符与口音和变音符号分开。此外,我们将进行兼容性分解,以Java枚举NFKD表示。此外,我们使用兼容性分解,因为它比规范方法能分解更多的连接词(例如,连接词 “fi”)。
Second, we will remove all characters matching the Unicode Mark category using the \p{M} regex expression. We pick this category because it offers the broadest range of marks.
其次,我们将使用p{M}重码表达式删除所有符合Unicode Mark类别的字符。我们选择这个类别是因为它提供了最广泛的标记范围。
5. Using Core Java
5.使用Core Java
Let’s start with some examples using core Java.
让我们从使用核心Java的一些例子开始。
5.1. Check if a String Is Normalized
5.1.检查一个字符串是否规范化
Before we perform a normalization, we might want to check that the String isn’t already normalized:
在我们执行规范化之前,我们可能想检查一下String是否已经被规范化。
assertFalse(Normalizer.isNormalized("āăąēîïĩíĝġńñšŝśûůŷ", Normalizer.Form.NFKD));
5.2. String Decomposition
5.2.字符串分解
If our String is not normalized, we proceed to the next step. To separate ASCII characters from diacritical marks, we will perform Unicode text normalization using compatibility decomposition:
如果我们的String没有被规范化,我们就进入下一个步骤。为了将ASCII字符与变音符号分开,我们将使用兼容性分解法进行Unicode文本规范化。
private static String normalize(String input) {
return input == null ? null : Normalizer.normalize(input, Normalizer.Form.NFKD);
}
After this step, both letters “â” and “ä” will be reduced to “a” followed by respective diacritical marks.
在这一步之后,”â “和 “ä “这两个字母将被简化为 “a”,后面再加上各自的变音符号。
5.3. Removal of Code Points Representing Diacritical and Accent Marks
5.3.移除代表音标和重音符号的代码点
Once we have decomposed our String, we want to remove unwanted code points. Therefore, we will use the Unicode regular expression \p{M}:
一旦我们分解了我们的String,我们要删除不需要的代码点。因此,我们将使用Unicode正则表达式 p{M}。
static String removeAccents(String input) {
return normalize(input).replaceAll("\\p{M}", "");
}
5.4. Tests
5.4. 测试
Let’s see how our decomposition works in practice. Firstly, let’s pick characters having normalization form defined by Unicode and expect to remove all diacritical marks:
让我们看看我们的分解在实践中是如何运作的。首先,让我们挑选具有Unicode定义的规范化形式的字符,并期望去除所有的变音符号。
@Test
void givenStringWithDecomposableUnicodeCharacters_whenRemoveAccents_thenReturnASCIIString() {
assertEquals("aaaeiiiiggnnsssuuy", StringNormalizer.removeAccents("āăąēîïĩíĝġńñšŝśûůŷ"));
}
Secondly, let’s pick a few characters without decomposition mapping:
其次,让我们挑选几个没有分解映射的字符。
@Test
void givenStringWithNondecomposableUnicodeCharacters_whenRemoveAccents_thenReturnOriginalString() {
assertEquals("łđħœ", StringNormalizer.removeAccents("łđħœ"));
}
As expected, our method was unable to decompose them.
正如预期的那样,我们的方法无法对它们进行分解。
Additionally, we can create a test to validate the hex codes of decomposed characters:
此外,我们可以创建一个测试来验证分解后的字符的十六进制代码。
@Test
void givenStringWithDecomposableUnicodeCharacters_whenUnicodeValueOfNormalizedString_thenReturnUnicodeValue() {
assertEquals("\\u0066 \\u0069", StringNormalizer.unicodeValueOfNormalizedString("fi"));
assertEquals("\\u0061 \\u0304", StringNormalizer.unicodeValueOfNormalizedString("ā"));
assertEquals("\\u0069 \\u0308", StringNormalizer.unicodeValueOfNormalizedString("ï"));
assertEquals("\\u006e \\u0301", StringNormalizer.unicodeValueOfNormalizedString("ń"));
}
5.5. Compare Strings Including Accents Using Collator
5.5.使用Collator比较包括重音的字符串
Package java.text includes another interesting class – Collator. It enables us to perform locale-sensitive String comparisons. An important configuration property is the Collator’s strength. This property defines the minimum level of difference considered significant during a comparison.
包java.text包括另一个有趣的类 – Collator。它使我们能够执行对本地敏感的String比较。一个重要的配置属性是Collator的强度。这个属性定义了在比较过程中被视为重要的最小差异水平。
Java provides four strength values for a Collator:
Java为Collator提供四个强度值。
- PRIMARY: comparison omitting case and accents
- SECONDARY: comparison omitting case but including accents and diacritics
- TERTIARY: comparison including case and accents
- IDENTICAL: all differences are significant
Let’s check some examples, first with primary strength:
让我们检查一些例子,首先是主要力量。
Collator collator = Collator.getInstance();
collator.setDecomposition(2);
collator.setStrength(0);
assertEquals(0, collator.compare("a", "a"));
assertEquals(0, collator.compare("ä", "a"));
assertEquals(0, collator.compare("A", "a"));
assertEquals(1, collator.compare("b", "a"));
Secondary strength turns on accent sensitivity:
次要强度打开重音敏感度。
collator.setStrength(1);
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(0, collator.compare("A", "a"));
assertEquals(0, collator.compare("a", "a"));
Tertiary strength includes case:
三级力量包括案例。
collator.setStrength(2);
assertEquals(1, collator.compare("A", "a"));
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(0, collator.compare("a", "a"));
assertEquals(0, collator.compare(valueOf(toChars(0x0001)), valueOf(toChars(0x0002))));
Identical strength makes all differences important. The penultimate example is interesting, as we can detect the difference between Unicode control code points +U001 (code for “Start of Heading”) and +U002 (“Start of Text”):
相同的强度使得所有的差异都很重要。倒数第二个例子很有意思,我们可以发现Unicode控制代码点+U001(代码为 “标题的开始”)和+U002(”文本的开始”)之间的区别。
collator.setStrength(3);
assertEquals(1, collator.compare("A", "a"));
assertEquals(1, collator.compare("ä", "a"));
assertEquals(1, collator.compare("b", "a"));
assertEquals(-1, collator.compare(valueOf(toChars(0x0001)), valueOf(toChars(0x0002))));
assertEquals(0, collator.compare("a", "a")));
One last example worth mentioning shows that if the character doesn’t have a defined decomposition rule, it won’t be considered equal to another character with the same base letter. This is due to the fact that Collator won’t be able to perform the Unicode decomposition:
最后一个值得一提的例子表明,如果该字符没有定义的分解规则,它将不会被认为与另一个具有相同基础字母的字符相等。这是由于Collator将无法执行Unicode分解的事实。
collator.setStrength(0);
assertEquals(1, collator.compare("ł", "l"));
assertEquals(1, collator.compare("ø", "o"));
6. Using Apache Commons StringUtils
6.使用Apache Commons的StringUtils
Now that we’ve seen how to use core Java to remove accents, we’ll check what Apache Commons Text offers. As we’ll soon learn, it’s easier to use, but we have less control over the decomposition process. Under the hood it uses the Normalizer.normalize() method with NFD decomposition form and \p{InCombiningDiacriticalMarks} regular expression:
现在我们已经看到了如何使用核心Java来去除重音,我们将检查Apache Commons Text提供的内容。我们很快就会知道,它更容易使用,但我们对分解过程的控制较少。在引擎盖下,它使用Normalizer.normalize()方法与NFD分解形式和 `p{InCombiningDiacriticalMarks}正则表达。
static String removeAccentsWithApacheCommons(String input) {
return StringUtils.stripAccents(input);
}
6.1. Tests
6.1.测试
Let’s see this method in practice — first, only with decomposable Unicode characters:
让我们来看看这个方法的实践–首先,只用可分解的Unicode字符。
@Test
void givenStringWithDecomposableUnicodeCharacters_whenRemoveAccentsWithApacheCommons_thenReturnASCIIString() {
assertEquals("aaaeiiiiggnnsssuuy", StringNormalizer.removeAccentsWithApacheCommons("āăąēîïĩíĝġńñšŝśûůŷ"));
}
As expected, we got rid of all the accents.
正如所料,我们摆脱了所有的口音。
Let’s try a string containing ligature and letters with stroke:
让我们试试一个字符串含有笔画的连接词和字母。
@Test
void givenStringWithNondecomposableUnicodeCharacters_whenRemoveAccentsWithApacheCommons_thenReturnModifiedString() {
assertEquals("lđħœ", StringNormalizer.removeAccentsWithApacheCommons("łđħœ"));
}
As we can see, the StringUtils.stripAccents() method manually defines the translation rule for Latin ł and Ł characters. But, unfortunately, it doesn’t normalize other ligatures.
我们可以看到,StringUtils.stripAccents()方法手动定义了拉丁语ł和Ł字符的翻译规则。但是,不幸的是,它并没有将其他的连字符规范化。
7. Limitations of Character Decomposition in Java
7.Java中字符分解的局限性
To sum up, we saw that some characters do not have defined decomposition rules. More specifically, Unicode doesn’t define decomposition rules for ligatures and characters with the stroke. Because of that, Java won’t be able to normalize them, either. If we want to get rid of these characters, we have to define transcription mapping manually.
综上所述,我们看到有些字符没有定义分解规则。更具体地说,Unicode没有为连字符和带笔画的字符定义分解规则。正因为如此,Java也无法对它们进行规范化处理。如果我们想摆脱这些字符,我们必须手动定义转录映射。
Finally, it’s worth considering whether we need to get rid of accents and diacritics. For some languages, a letter stripped from diacritical marks won’t make much sense. In such cases, a better idea is to use the Collator class and compare two Strings, including locale information.
最后,值得考虑的是,我们是否需要去掉重音和变音符号。对于某些语言来说,一个去除变音符号的字母不会有太大意义。在这种情况下,一个更好的主意是使用Collator类并比较两个Strings,包括locale信息。
8. Conclusion
8.结语
In this article, we looked into removing accents and diacritical marks using core Java and the popular Java utility library, Apache Commons. We also saw a few examples and learned how to compare text containing accents, as well as a few things to watch out for when working with text containing accents.
在这篇文章中,我们研究了使用核心Java和流行的Java工具库Apache Commons删除重音和变音符号。我们还看到了一些例子,了解了如何比较含有重音的文本,以及在处理含有重音的文本时需要注意的一些问题。
As always, the full source code of the article is available over on GitHub.
一如既往,该文章的完整源代码可在GitHub上获得。