Check if Letter Is Emoji With Java – 用 Java 检查字母是否是表情符号

最后修改: 2023年 9月 21日

中文/混合/英文(键盘快捷键:t)

1. Overview

1.概述

Emojis appear in a lot of text that we may need to process in our code. For example, this could be when we’re working with email or instant messaging services.

表情符号出现在我们可能需要在代码中处理的大量文本中。例如,我们在处理电子邮件或即时信息服务时可能会使用表情符号。

In this tutorial, we’ll see the multiple methods we can use in Java applications to detect emojis.

在本教程中,我们将了解 Java 应用程序中用于检测表情符号的多种方法。

2. How Does Java Represent Emojis?

2.Java 如何表示表情符号?

Every emoji has a unique Unicode value which represents it. Java encodes Unicode characters in Strings using UTF-16.

每个表情符号都有一个唯一的 Unicode 值来表示它。Java 使用UTF-16编码String中的Unicode字符。

UTF-16 can encode all Unicode code points. A code point may consist of either one or two code units. If two are needed because the Unicode value is beyond the range we can store in 16 bits, then we call it a surrogate pair.

UTF-16 可以编码所有 Unicode 码位。一个码位可以由一个或两个编码单元组成。如果由于 Unicode 值超出了 16 位的存储范围而需要两个编码单位,我们就称之为代理对。

A surrogate pair is simply two characters (or code units) which when combined represent a single Unicode character (or code point). There is a reserved range of code units for surrogate pairs.

代用字符对是两个字符(或代码单元),组合在一起代表一个 Unicode 字符(或码位)。代用对有一个保留的代码单位范围。

For example, the Skull and Crossbones emoji has the Unicode value “U+2620” which is stored in a String as “\u2620️️”. We only required a single code unit. However, the Bear Face emoji has the Unicode character “U+1F43B” which would be stored in a String as “\uD83D\uDC3B”. This required two code units because the Unicode value was too high for a single unit.

例如,骷髅和交叉骨表情符号的 Unicode 值为 “U+2620″,在 String 中存储为”\u2620️️”。我们只需要一个代码单元。然而,”熊脸 “表情符号的统一码字符为 “U+1F43B”,在String中将存储为”\uD83D\uDC3B”。这需要两个代码单元,因为对于单个单元来说,Unicode 值过高。

There are extensions to this we’ll look at later but that explains the basics.

我们稍后会看到它的扩展功能,但这只是最基本的说明。

3. emoji-java Library

3. emoji-java

An off-the-shelf solution is to use emoji-java. To use this library in our project, we’ll need to import it into our pom.xml:

现成的解决方案是使用 emoji-java。要在项目中使用该库,我们需要将其导入 pom.xml 中:

 <dependency>
     <groupId>com.vdurmont</groupId>
     <artifactId>emoji-java</artifactId>
    <version>5.1.1</version>
</dependency>

The latest version is available in the Maven Repository.

Maven 资源库中提供了最新版本。

It’s simple to use this library to check if a letter is an emoji. It provides the static isEmoji() method in the EmojiManager utility class.

使用该库检查字母是否为表情符号非常简单。它在 EmojiManager 实用程序类中提供了静态 isEmoji() 方法

The method takes a single String argument and returns true if the String is an emoji, or else returns false:

该方法接收一个 String 参数,如果 String 是表情符号,则返回 true,否则返回 false

@Test
void givenAWord_whenUsingEmojiJava_thenDetectEmoji(){
    boolean emoji = EmojiManager.isEmoji("\uD83D\uDC3B");
    assertTrue(emoji);

    boolean notEmoji = EmojiManager.isEmoji("w");
    assertFalse(notEmoji);
}

We can see from this test that the library has correctly identified the surrogate pair as an emoji. It has also asserted that the single letter “w” is not.

从这个测试中我们可以看到,该库已正确地将这对代用字符识别为一个表情符号。它还断言单个字母”w“不是表情符号。

This library has a whole host of other features. So it’s a strong candidate for dealing with emojis in Java.

该库还具有大量其他功能。因此,它是在 Java 中处理表情符号的有力候选工具。

4. Using Regex

4.使用 Regex

As we discussed earlier, we know roughly what an emoji will look like within a Java String. We also know the potential range of values that are reserved for surrogate pairs. The first code unit will be between U+D800 and U+DBFF, and the second code unit will be between U+DC00 and U+DFFF.

正如我们前面所讨论的,我们大致知道一个表情符号在 Java String 中的样子。我们还知道为代理对保留的潜在值范围。第一个代码单位将介于 U+D800U+DBFF 之间,第二个代码单位将介于 U+DC00U+DFFF 之间。

We can use this insight to write a regex for checking if a given String is one of the emojis represented by a surrogate pair. We need to note here that not all surrogate pairs are emojis, so this may give us false positives:

我们可以利用这一洞察力编写 regex 来检查给定的 String 是否是代理对所代表的表情符号之一。我们需要注意,并非所有代理对都是表情符号,因此这可能会产生误报

@Test
void givenAWord_whenUsingRegex_thenDetectEmoji(){
    String regexPattern = "[\uD800-\uDBFF\uDC00-\uDFFF]+";
    String emojiString = "\uD83D\uDC3B";
    boolean emoji = emojiString.matches(regexPattern);
    assertTrue(emoji);

    String notEmojiString = "w";
    boolean notEmoji = notEmojiString.matches(regexPattern);
    assertFalse(notEmoji);
}

However, it’s not always as simple as checking within the expected range. As we already saw, some emojis only use a single code unit. Also, many have modifiers that append onto the end of the Unicode and change the appearance of the emoji. We can also form more complex emojis by combining several emojis with Zero Width Joiner (ZWJ) characters in between them.

然而,检查是否在预期范围内并不总是那么简单。正如我们已经看到的,有些表情符号只使用一个代码单元。此外,许多表情符号还带有修饰符,可附加到 Unicode 的末尾并改变表情符号的外观。我们还可以通过 将多个表情符号与中间的零宽度连接符 (ZWJ) 字符组合起来,形成更复杂的表情符号。

A good example of this is the Pirate Flag emoji which we can build using a Waving Black Flag and a Skull and Crossbones with a ZWJ in the middle. With this in mind, it’s clear the regex we’d need is much more complex to be certain we’re capturing all emojis.

海盗旗表情符号就是一个很好的例子,我们可以使用挥舞的黑旗和中间带有 ZWJ 的骷髅头和交叉骨来制作该表情符号。有鉴于此,我们需要的 regex 显然要复杂得多,才能确保捕获到所有表情符号。

Unicode published a document listing all current emoji values. We could either write a parser for this document or extract the ranges into our own configuration files. The results would then be useable for our own reliable emoji finder.

Unicode发布了一份文档,其中列出了当前所有的表情符号值。我们可以为该文档编写一个解析器,或者将这些范围提取到我们自己的配置文件中。

5. Conclusion

5.结论

In this article, we looked at how Java represents Unicode emojis as UTF-16 surrogate pairs. There’s a library, emoji-java, we can use in our code to detect them. This library offers a simple method to check if a String is an emoji.

在本文中,我们了解了 Java 如何将 Unicode 表情符号表示为 UTF-16 代理对。我们可以在代码中使用 emojii-java 库来检测它们。该库提供了一种简单的方法来检查 String 是否是表情符号。

We also have the option of writing our own detection code using regex. However, this is complex and needs to cover a wide range of possible values which is ever-growing. To do this successfully, we’d need to be able to accept updates from Unicode into our program.

我们还可以使用 regex 编写自己的检测代码。不过,这很复杂,需要涵盖大量可能的值,而且这些值还在不断增加。要成功做到这一点,我们需要在程序中接受 Unicode 的更新。

As always, the full code for the examples is available over on GitHub.

与往常一样,这些示例的完整代码可在 GitHub 上获取。