1. Introduction
1.导言
Non-printable Unicode characters are control characters, style markers, and other invisible symbols that we can find in text but aren’t meant to show. Besides, these letters can cause problems with text handling, showing, and saving. So, it’s very important to have ways of changing or getting rid of such characters as required.
不可打印的Unicode字符是控制字符、样式标记和其他隐形符号,我们可以在文本中找到它们,但并不打算显示它们。此外,这些字母还会导致文本处理、显示和保存时出现问题。因此,拥有根据需要更改或删除此类字符的方法非常重要。
In this tutorial, we’ll look at different ways to replace it.
在本教程中,我们将介绍不同的替换方法。
2. Using Regular Expressions
2.使用正则表达式
Java’s String class has strong ways to handle text changes, and regular expressions provide a short way to match and replace patterns in strings. We can use simple patterns to find and change non-printable Unicode letters as follows:
Java 的 String 类具有处理文本更改的强大方法,而 regular expressions 则提供了一种在字符串中匹配和替换模式的简便方法。我们可以使用简单的模式来查找和更改不可打印的 Unicode 字母,如下所示:
@Test
public void givenTextWithNonPrintableChars_whenUsingRegularExpression_thenGetSanitizedText() {
String originalText = "\n\nWelcome \n\n\n\tto Baeldung!\n\t";
String expected = "Welcome to Baeldung!";
String regex = "[\\p{C}]";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(originalText);
String sanitizedText = matcher.replaceAll("");
assertEquals(expected, sanitizedText);
}
In this test method, the regular expression \\p{C} represents any control characters (non-printable Unicode characters) in a given originalText. Besides, we compile the regular expression into a pattern using the Pattern.compile(regex) method, and then we create a Matcher object by calling this pattern with the originalText as a parameter.
在此测试方法中,正则表达式 \p{C} 表示给定 originalText 中的任何控制字符(不可打印的 Unicode 字符)。此外,我们使用 Pattern.compile(regex) 方法将正则表达式编译成一个模式,然后以 originalText 为参数调用该模式,创建一个 Matcher 对象。
Then, we call the Matcher.replaceAll() method to replace all instances of matched control characters with an empty string and hence eradicate them from the source text. Lastly, we compare the sanitizedtext with the expected string using the assertEquals() method.
然后,我们调用 Matcher.replaceAll() 方法,用空字符串替换所有匹配的控制字符实例,从而将它们从源代码文本中删除。最后,我们使用 assertEquals() 方法将 sanitizedtext 与 expected 字符串进行比较。
3. Custom Implementation
3.定制实施
We can utilize another approach to go through the letters of our text and remove special Unicode characters based on their numbers. Let’s take a simple example:
我们可以利用另一种方法来查看文本中的字母,并根据其编号删除特殊的 Unicode 字符。让我们举一个简单的例子:
@Test
public void givenTextWithNonPrintableChars_whenCustomImplementation_thenGetSanitizedText() {
String originalText = "\n\nWelcome \n\n\n\tto Baeldung!\n\t";
String expected = "Welcome to Baeldung!";
StringBuilder strBuilder = new StringBuilder();
originalText.codePoints().forEach((i) -> {
if (i >= 32 && i != 127) {
strBuilder.append(Character.toChars(i));
}
});
assertEquals(expected, strBuilder.toString());
}
Here, we employ originalText.codePoints() and a forEach loop to iterate through the Unicode code of the original text. Then, we set the condition to eliminate characters with values below 32 and equal to 127, representing non-printable and control characters, respectively.
在此,我们使用 originalText.codePoints() 和 forEach 循环遍历原始文本的 Unicode 代码。然后,我们设置条件以消除值低于 32 和等于 127 的字符,它们分别代表不可打印字符和控制字符。
We then append the characters to the StringBuilder object using the strBuilder.append(Character.toChars (i)) method.
然后,我们使用 strBuilder.append(Character.toChars (i)) 将字符追加到 StringBuilder 对象中。方法将字符追加到 StringBuilder 对象中。
4. Conclusion
4.结论
In conclusion, this tutorial delved into addressing the challenges posed by non-printable Unicode characters in written text. The exploration encompassed two distinct methods leveraging regular expressions in Java’s String class and implementing a custom solution.
总之,本教程深入探讨了如何应对书面文本中不可打印的 Unicode 字符带来的挑战。探索包括利用 Java 字符串类中的正则表达式和实现自定义解决方案这两种不同的方法。
As always, the complete code samples for this article can be found over on GitHub.
与往常一样,本文的完整代码示例可在 GitHub 上找到。