1. Overview
1.概述
According to Wikipedia, an anagram is a word or phrase formed by rearranging the letters of a different word or phrase.
根据维基百科,变位词是指通过重新排列不同单词或短语的字母而形成的单词或短语。
We can generalize this in string processing by saying that an anagram of a string is another string with exactly the same quantity of each character in it, in any order.
我们可以在字符串处理中概括地说,一个字符串的变形是另一个字符串,其中每个字符的数量完全相同,以任何顺序排列。
In this tutorial, we’re going to look at detecting whole string anagrams where the quantity of each character must be equal, including non-alpha characters such as spaces and digits. For example, “!low-salt!” and “owls-lat!!” would be considered anagrams as they contain exactly the same characters.
在本教程中,我们将着眼于检测整个字符串的变形,其中每个字符的数量必须相等,包括非α字符,如空格和数字。例如,“!低盐!”和“owls-lat!!”将被认为是变形词,因为它们包含完全相同的字符。
2. Solution
2.解决办法
Let’s compare a few solutions that can decide if two strings are anagrams. Each solution will check at the start whether the two strings have the same number of characters. This is a quick way to exit early since inputs with different lengths cannot be anagrams.
让我们来比较一下几个可以决定两个字符串是否为异文的解决方案。每个解决方案都会在开始时检查这两个字符串是否有相同的字符数。这是一个提前退出的快速方法,因为长度不同的输入不可能是同义词。
For each possible solution, let’s look at the implementation complexity for us as developers. We’ll also look at the time complexity for the CPU, using big O notation.
对于每个可能的解决方案,让我们看看作为开发者的我们的实现复杂性。我们还将使用big O符号来看看CPU的时间复杂性。
3. Check by Sorting
3.通过分类检查
We can rearrange the characters of each string by sorting their characters, which will produce two normalized arrays of characters.
我们可以通过对每个字符串的字符进行排序来重新排列,这将产生两个规范化的字符数组。
If two strings are anagrams, their normalized forms should be the same.
如果两个字符串是缩写,它们的规范化形式应该是相同的。
In Java, we can first convert the two strings into char[] arrays. Then we can sort these two arrays and check for equality:
在Java中,我们可以首先将这两个字符串转换成char[]数组。然后我们可以对这两个数组进行排序并检查是否相等。
boolean isAnagramSort(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
char[] a1 = string1.toCharArray();
char[] a2 = string2.toCharArray();
Arrays.sort(a1);
Arrays.sort(a2);
return Arrays.equals(a1, a2);
}
This solution is easy to understand and implement. However, the overall running time of this algorithm is O(n log n) because sorting an array of n characters takes O(n log n) time.
这个解决方案很容易理解和实现。然而,这个算法的总体运行时间是O(n log n),因为对一个n字符的数组进行排序需要O(n log n)时间。
For the algorithm to function, it must make a copy of both input strings as character arrays, using a little extra memory.
为了使该算法发挥作用,它必须将两个输入字符串复制成字符数组,使用一点额外的内存。
4. Check by Counting
4.通过计数检查
An alternative strategy is to count the number of occurrences of each character in our inputs. If these histograms are equal between the inputs, then the strings are anagrams.
另一种策略是计算每个字符在我们的输入中出现的次数。如果这些直方图在输入之间是相等的,那么这些字符串就是不一致的。
To save a little memory, let’s build only one histogram. We’ll increment the counts for each character in the first string, and decrement the count for each character in the second. If the two strings are anagrams, then the result will be that everything balances to 0.
为了节省一点内存,让我们只建立一个直方图。我们将递增第一个字符串中每个字符的计数,并递减第二个字符串中每个字符的计数。如果这两个字符串是不相干的,那么结果将是一切都平衡为0。
The histogram needs a fixed-size table of counts with a size defined by the character set size. For example, if we only use one byte to store each character, then we can use a counting array size of 256 to count the occurrence of each character:
直方图需要一个固定大小的计数表,其大小由字符集大小定义。例如,如果我们只用一个字节来存储每个字符,那么我们可以用256的计数数组大小来计算每个字符的出现次数。
private static int CHARACTER_RANGE= 256;
public boolean isAnagramCounting(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
int count[] = new int[CHARACTER_RANGE];
for (int i = 0; i < string1.length(); i++) {
count[string1.charAt(i)]++;
count[string2.charAt(i)]--;
}
for (int i = 0; i < CHARACTER_RANGE; i++) {
if (count[i] != 0) {
return false;
}
}
return true;
}
This solution is faster with the time complexity of O(n). However, it needs extra space for the counting array. At 256 integers, for ASCII that’s not too bad.
这种解决方案速度较快,时间复杂度为O(n)。然而,它需要额外的空间用于计数阵列。在256个整数中,对于ASCII来说,这还不算太糟糕。
However, if we need to increase CHARACTER_RANGE to support multiple-byte character sets such as UTF-8, this would become very memory hungry. Therefore, it’s only really practical when the number of possible characters is in a small range.
然而,如果我们需要增加CHARACTER_RANGE以支持多字节的字符集,如UTF-8,这将变得非常耗费内存。因此,只有当可能的字符数在一个小范围内时,它才真正实用。
From a development point of view, this solution contains more code to maintain and makes less use of Java library functions.
从开发的角度来看,这种解决方案包含更多的代码需要维护,而且对Java库函数的使用较少。
5. Check with MultiSet
5.用MultiSet检查
We can simplify the counting and comparing process by using MultiSet. MultiSet is a collection that supports order-independent equality with duplicate elements. For example, the multisets {a, a, b} and {a, b, a} are equal.
我们可以通过使用MultiSet来简化计数和比较的过程。MultiSet是一个集合,它支持与重复元素的顺序无关的平等。例如,多集{a, a, b}和{a, b, a}是相等的。
To use Multiset, we first need to add the Guava dependency to our project pom.xml file:
要使用Multiset,我们首先需要将Guava依赖性添加到我们的项目pom.xml文件。
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>31.0.1-jre</version>
</dependency>
We will convert each of our input strings into a MultiSet of characters. Then we’ll check if they’re equal:
我们将把每个输入的字符串转换成一个MultiSet的字符。然后我们将检查它们是否相等。
boolean isAnagramMultiset(String string1, String string2) {
if (string1.length() != string2.length()) {
return false;
}
Multiset<Character> multiset1 = HashMultiset.create();
Multiset<Character> multiset2 = HashMultiset.create();
for (int i = 0; i < string1.length(); i++) {
multiset1.add(string1.charAt(i));
multiset2.add(string2.charAt(i));
}
return multiset1.equals(multiset2);
}
This algorithm solves the problem in O(n) time without having to declare a big counting array.
这个算法在O(n)时间内解决了这个问题,而无需声明一个大的计数阵列。
It’s similar to the previous counting solution. However, rather than using a fixed-size table to count, we take advantage of the MutlitSet class to simulate a variable-sized table, with a count for each character.
这与之前的计数方案类似。然而,我们不是用一个固定大小的表来计数,而是利用MutlitSet类来模拟一个可变大小的表,每个字符都有一个计数。
The code for this solution makes more use of high-level library capabilities than our counting solution.
这个方案的代码比我们的计数方案更多地使用了高级库的能力。
6. Letter-based Anagram
6.基于字母的变位法
The examples so far do not strictly adhere to the linguistic definition of an anagram. This is because they consider punctuation characters part of the anagram, and they are case sensitive.
到目前为止的例子并没有严格遵守变位的语言学定义。这是因为它们认为标点符号是变位的一部分,而且它们对大小写敏感。
Let’s adapt the algorithms to enable a letter-based anagram. Let’s only consider the rearrangement of case-insensitive letters, irrespective of other characters such as white spaces and punctuations. For example, “A decimal point” and “I’m a dot in place.” would be anagrams of each other.
让我们调整算法,以实现基于字母的嵌套。我们只考虑不分大小写的字母的重新排列,而不考虑其他字符,如空白和标点符号。例如,“一个小数点”和“我是一个点的位置。”将是彼此的变形词。
To solve this problem, we can first preprocess the two input strings to filter out unwanted characters and convert letters into lower case letters. Then we can use one of the above solutions (say, the MultiSet solution) to check anagrams on the processed strings:
为了解决这个问题,我们可以先对两个输入字符串进行预处理,过滤掉不需要的字符,并将字母转换成小写字母。然后,我们可以使用上述解决方案之一(例如,MultiSet解决方案)来检查经过处理的字符串上的拼写。
String preprocess(String source) {
return source.replaceAll("[^a-zA-Z]", "").toLowerCase();
}
boolean isLetterBasedAnagramMultiset(String string1, String string2) {
return isAnagramMultiset(preprocess(string1), preprocess(string2));
}
This approach can be a general way to solve all variants of the anagram problems. For example, if we also want to include digits, we just need to adjust the preprocessing filter.
这种方法可以成为解决所有变体的变体问题的一般方法。例如,如果我们还想包括数字,我们只需要调整预处理过滤器。
7. Conclusion
7.结语
In this article, we looked at three algorithms for checking whether a given string is an anagram of another, character for character. For each solution, we discussed the trade-offs between the speed, readability, and size of memory required.
在这篇文章中,我们研究了三种算法,用于检查一个给定的字符串是否是另一个字符的变形,逐字逐句。对于每个解决方案,我们讨论了速度、可读性和所需内存大小之间的权衡。
We also looked at how to adapt the algorithms to check for anagrams in the more traditional linguistic sense. We achieved this by preprocessing the inputs into lowercase letters.
我们还研究了如何调整算法以检查更传统的语言学意义上的拼写。我们通过对输入的小写字母进行预处理来实现这一点。
As always, the source code for the article is available over on GitHub.
一如既往,该文章的源代码可在GitHub上获取。