1. Overview
1.概述
Checking if String complies with business rules is crucial for most applications. Often, we need to check if the name contains only allowed characters, if the email is in the correct format, or if there are restrictions on the password.
检查 String 是否符合业务规则对于大多数应用程序来说都至关重要。通常,我们需要检查名称是否只包含允许的字符,电子邮件格式是否正确,或者密码是否有限制。
In this tutorial, we’ll learn how to check if a String is alphanumeric, which can be helpful in many cases.
在本教程中,我们将学习如何检查 String 是否为字母数字,这在很多情况下都很有用。
2. Alphanumeric Characters
2.字母数字字符
First, let’s identify the term explicitly to avoid any confusion. Alphanumeric characters are a combination of letters and numbers. More specifically, Latin letters and Arabic digits. Thus, we will not consider any special characters or underscores as part of Alphanumeric characters.
首先,让我们明确一下这个术语,以免混淆。字母数字字符是字母和数字的组合。更具体地说,拉丁字母和阿拉伯数字。因此,我们不会将任何特殊字符或下划线视为字母数字字符的一部分。
3. Checking Approaches
3.检查方法
In general, we have two main approaches to this problem. The first uses a regex pattern, and the second checks all the characters individually.
一般来说,我们有两种主要方法来解决这个问题。第一种是使用 regex 模式,第二种是逐个检查所有字符。
3.1. Using Regex
3.1.使用 Regex
This is the simplest approach, which requires us to provide the correct regex pattern. In our case, we’ll be using this one:
这是最简单的方法,要求我们提供正确的 regex 模式。在本例中,我们将使用这种模式:
String REGEX = "^[a-zA-Z0-9]*$";
Technically, we could use \w shortcut to identify “word characters,” but unfortunately, it doesn’t comply with our requirement as this pattern might induce an underscore and can be expressed like this: [a-zA-Z0-9_].
从技术上讲,我们可以使用 w 快捷方式来识别 “单词字符”,但遗憾的是,它不符合我们的要求,因为这种模式可能会诱发下划线,可以这样表示:[a-zA-Z0-9_].
After identifying a correct pattern, the next step is to check a given String against it. It can be done directly on the String itself:
识别出正确的模式后,下一步就是根据该模式检查给定的 String 。可以直接对 String 本身进行检查:
boolean result = TEST_STRING.matches(REGEX);
However, it’s not the best way, especially if we need to do such checks regularly. The String would recompile regex on every invocation of the match(String) method. Thus, it’s better to use a static Pattern:
然而,这并不是最好的方法,尤其是当我们需要定期进行此类检查时。String 会在每次调用 match(String) 方法时重新编译regex。因此,使用静态 Pattern: 会更好。
Pattern PATTERN = Pattern.compile(REGEX);
Matcher matcher = PATTERN.matcher(TEST_STRING);
boolean result = matcher.matches();
Overall, it’s a straightforward, flexible approach that makes the code simple and understandable.
总之,这是一种直接、灵活的方法,使代码简单易懂。
3.2. Checking Characters One-by-One
3.2.逐个检查字符
Another approach is to check each character in the String. We can use any approach to iterate over a given String. For demonstration purposes, let’s go with a simple for loop:
另一种方法是检查 String 中的每个字符。我们可以使用任何方法对给定的 String 进行迭代。为了演示的目的,让我们使用一个简单的 for 循环:
boolean result = true;
for (int i = 0; i < TEST_STRING.length(); ++i) {
int codePoint = TEST_STRING.codePointAt(i);
if (!isAlphanumeric(codePoint)) {
result = false;
break;
}
}
We can implement isAlphanumeric(int) in several ways, but overall, we must match the character code in the ASCII table. We’ll be using an ASCII table because we outline the initial constraints of using Latin letters and Arabic digits:
我们可以通过多种方式实现 isAlphanumeric(int) ,但总的来说,我们必须与 ASCII 表中的字符代码相匹配。我们将使用 ASCII 表,因为我们列出了使用拉丁字母和阿拉伯数字的初始限制:
boolean isAlphanumeric(final int codePoint) {
return (codePoint >= 65 && codePoint <= 90) ||
(codePoint >= 97 && codePoint <= 122) ||
(codePoint >= 48 && codePoint <= 57);
}
Additionally, we can use Character.isAlphabetic(int) and Character.isDigit(int). These methods are highly optimized and may boost the performance of the application:
此外,我们还可以使用 Character.isAlphabetic(int) 和 Character.isDigit(int) 。这些方法经过高度优化,可以提高应用程序的性能:
boolean result = true;
for (int i = 0; i < TEST_STRING.length(); ++i) {
final int codePoint = TEST_STRING.codePointAt(i);
if (!Character.isAlphabetic(codePoint) || !Character.isDigit(codePoint)) {
result = false;
break;
}
}
This approach requires more code and also is highly imperative. At the same time, it provides us with the benefits of transparent implementation. However, different implementations might unintentionally worsen the space complexity of this approach:
这种方法需要更多的代码,也非常必要。与此同时,它还为我们提供了透明实现的好处。不过,不同的实现方式可能会无意中加剧这种方法的空间复杂性:
boolean result = true;
for (final char c : TEST_STRING.toCharArray()) {
if (!isAlphanumeric(c)) {
result = false;
break;
}
}
The toCharArray() method would create a separate array to contain the characters from the String, degrading the space complexity from O(1) to O(n). We can say the same with the Stream API approach:
toCharArray()方法将创建一个单独的数组来包含来自 String 的字符,从而将空间复杂度从 O(1) 降为 O(n)。我们可以说,Stream API 方法也是如此:
boolean result = TEST_STRING.chars().allMatch(this::isAlphanumeric);
Please pay attention to these pitfalls, especially if the performance is crucial to the application.
请注意这些陷阱,尤其是当性能对应用程序至关重要时。
4. Pros and Cons
4.优点和缺点
From the previous examples, it’s clear that the first approach is simpler to write and read, while the second one requires more code and potentially could contain more bugs. However, let’s compare them from the performance perspective with JMH. The tests are set to run only for a minute as it’s enough to compare their throughput.
从前面的示例中可以看出,第一种方法的编写和读取都比较简单,而第二种方法则需要更多代码,而且可能包含更多错误。不过,让我们使用 JMH 从性能角度对它们进行比较。测试设置为仅运行一分钟,因为这足以比较它们的吞吐量。
We get the following results. The score shows the number of operations in seconds. Thus, a higher score identified a more performant solution:
我们得到以下结果。得分显示的是以秒为单位的操作次数。因此,分数越高,说明解决方案的性能越好:
Benchmark Mode Cnt Score Error Units
AlphanumericPerformanceBenchmark.alphanumericIteration thrpt 165036629.641 ops/s
AlphanumericPerformanceBenchmark.alphanumericIterationWithCharacterChecks thrpt 2350726870.739 ops/s
AlphanumericPerformanceBenchmark.alphanumericIterationWithCopy thrpt 129884251.890 ops/s
AlphanumericPerformanceBenchmark.alphanumericIterationWithStream thrpt 40552684.681 ops/s
AlphanumericPerformanceBenchmark.alphanumericRegex thrpt 23739293.608 ops/s
AlphanumericPerformanceBenchmark.alphanumericRegexDirectlyOnString thrpt 10536565.422 ops/s
As we can see, we have the readability-performance tradeoff. More readable and more declarative solutions tend to be less performant. At the same time, please note that unnecessary optimization may do more harm than good. Thus, for most applications, regex is a good and clean solution that can be easily extended.
正如我们所看到的,我们在可读性和性能之间进行了权衡。可读性更高和声明性更强的解决方案往往性能较差。同时,请注意不必要的优化可能弊大于利。因此,对于大多数应用而言,regex 是一种可轻松扩展的良好而简洁的解决方案。
However, the iterative approach would perform better if an application relies on a high volume of texts matched to specific rules. Which ultimately reduces CPU usage and downtimes and increases the throughput.
不过,如果应用程序依赖于与特定规则匹配的大量文本,迭代方法的性能会更好。这最终会减少 CPU 占用率和停机时间,并提高吞吐量。
5. Conclusion
5.结论
There are a couple of ways to check if a String is alphanumeric. Both have pros and cons, which should be carefully considered. The choice can be reduced to the extensibility versus performance.
有几种方法可以检查 String 是否为字母数字。这两种方法各有利弊,应仔细考虑。这种选择可以归结为可扩展性与性能的对比。
Optimize the code when there’s a real need for performance, as optimized code is often less readable and more prone to hard-to-debug bugs.
在真正需要提高性能时优化代码,因为优化后的代码往往可读性较差,更容易出现难以调试的错误。
As always, the code is available over on GitHub.
与往常一样,代码可在 GitHub 上获取。