1. Overview
1.概述
In computer science, data compression techniques play an important role in optimizing storage and transmission efficiency. One such technique that has stood the test of time is Run-length Encoding (RLE).
在计算机科学领域,数据压缩技术在优化存储和传输效率方面发挥着重要作用。Run-length Encoding (RLE) 就是这样一种经受住时间考验的技术。
In this tutorial, we’ll understand RLE and explore how to implement encoding and decoding in Java.
在本教程中,我们将了解 RLE 并探讨如何在 Java 中实现编码和解码。
2. Understanding Run-Length Encoding
2.了解运行长度编码
Run-length Encoding is a simple yet effective form of lossless data compression. The basic idea behind RLE is to represent consecutive identical elements, called a “run” in a data stream by a single value and its count rather than as the original run.
运行长度编码是一种简单而有效的无损数据压缩方式。RLE的基本思想是用一个值及其计数来表示数据流中连续的相同元素(称为 “运行”),而不是用原始运行来表示。
This is particularly useful when dealing with repetitive sequences, as it significantly reduces the amount of space needed to store or transmit the data.
这在处理重复序列时特别有用,因为它大大减少了存储或传输数据所需的空间。
RLE is well suited to compress palette-based bitmap images, such as computer icons and animations. A well-known example is the Microsoft Windows 3.x startup screen, which is RLE compressed.
RLE 非常适合压缩基于调色板的位图图像,如计算机图标和动画。一个著名的例子就是微软视窗 3.x 的启动屏幕,它就是用 RLE 压缩的。
Let’s consider the following example:
让我们来看看下面这个例子:
String INPUT = "WWWWWWWWWWWWBAAACCDEEEEE";
If we apply a run-length encoding data compression algorithm to the above string, it can be rendered as follows:
如果我们对上述字符串应用运行长度编码数据压缩算法,它可以呈现如下效果:
String RLE = "12W1B3A2C1D5E";
In the encoded sequence, each character follows the number of times it appears consecutively. This rule allows us to easily reconstruct the original data during decoding.
在编码序列中,每个字符都遵循其连续出现的次数。通过这一规则,我们可以在解码时轻松地重建原始数据。
It’s worth noting that the standard RLE only works with “textual” input. If the input contains numbers, RLE can’t encode them in a nonambiguous way.
值得注意的是,标准 RLE 只适用于 “文本 “输入。如果输入包含数字,RLE 无法以非明确的方式对其进行编码。
In this tutorial, we’ll explore two run-length encoding and decoding approaches.
在本教程中,我们将探讨两种运行长度编码和解码方法。
3. Character Array Based Solution
3.基于字符数组的解决方案
Now that we understand RLE and how it works, a classic approach to implementing run-length encoding and decoding is converting the input strings into a char array and then applying the encoding and decoding rules to the array.
既然我们已经了解了 RLE 及其工作原理,那么实现运行长度编码和解码的经典方法就是将输入字符串转换为 char 数组,然后对数组应用编码和解码规则。
3.1. Creating the Encoding Method
3.1.创建编码方法
The key to implementing run-length encoding is to identify each run and count its length. Let’s first look at the implementation and then understand how it works:
实现运行长度编码的关键是识别每个运行并计算其长度。让我们先看看实现方法,然后了解它是如何工作的:
String runLengthEncode(String input) {
StringBuilder result = new StringBuilder();
int count = 1;
char[] chars = input.toCharArray();
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
if (i + 1 < chars.length && c == chars[i + 1]) {
count++;
} else {
result.append(count).append(c);
count = 1;
}
}
return result.toString();
}
Next, let’s quickly walk through the code above and understand the logic.
接下来,让我们快速浏览上面的代码,了解其中的逻辑。
First, we use StringBuilder to store each step’s result and concatenate them for better performance.
首先,我们使用StringBuilder来存储每个步骤的结果,并将它们连接起来以获得更好的性能。
After initializing a counter and converting the input string to a char array, the function iterates through each character in the input string.
在初始化计数器并将输入字符串转换为 char 数组后,该函数会遍历输入字符串中的每个字符。
For each character:
每个角色
- If the current character is the same as the next character and we are not at the end of the string, the count is incremented.
- If the current character is different from the next character or we are at the end of the string, the count and the current character are appended to the result StringBuilder. The count is then reset to 1 for the next unique character.
Finally, the StringBuilder is converted to a string using toString() and returned as the result of the encoding process.
最后,使用 toString() 将 StringBuilder 转换为字符串,并作为编码过程的结果返回。
When we test our INPUT with this encoding method, we get the expected result:
当我们使用这种编码方法测试 INPUT 时,会得到预期的结果:
assertEquals(RLE, runLengthEncode(INPUT));
3.2. Creating the Decoding Method
3.2.创建解码方法
Identifying each run is still crucial to decode a RLE string. A run includes the character and the number of times it appears, such as “12W” or “2C“.
识别每个运行对于解码 RLE 字符串仍然至关重要。运行包括字符及其出现的次数,如”12W“或”2C“。
Now, let’s create a decoding method:
现在,让我们创建一个解码方法:
String runLengthDecode(String rle) {
StringBuilder result = new StringBuilder();
char[] chars = rle.toCharArray();
int count = 0;
for (char c : chars) {
if (Character.isDigit(c)) {
count = 10 * count + Character.getNumericValue(c);
} else {
result.append(String.join("", Collections.nCopies(count, String.valueOf(c))));
count = 0;
}
}
return result.toString();
}
Next, let’s break down the code and understand step by step how it works.
接下来,让我们分解代码,逐步了解其工作原理。
First, we create a StringBuilder object to hold step results and convert the rle string to a char array for later processing.
首先,我们创建一个 StringBuilder 对象来保存步骤结果,并将 rle 字符串转换为字符数组,以便稍后处理。
Also, we initialized an integer variable count to keep track of the count of consecutive occurrences.
此外,我们还初始化了一个整数变量 count 以记录连续出现的次数。
Then, we iterate through each character in the RLE-encoded string. For each character:
然后,我们遍历 RLE 编码字符串中的每个字符。对于每个字符
- If the character is a digit, it contributes to the count. The count is updated by appending the digit to the existing count using the formula 10 * count + Character.getNumericValue(c). This is done to handle multi-digit counts.
- If the character is not a digit, a new character is encountered. The current character is then appended to the result StringBuilder repeated count times using Collections.nCopies(). The count is then reset to 0 for the next set of consecutive occurrences.
It’s worth noting that if we work with Java 11 or later, String.valueOf(c).repeat(count) is a better alternative to repeat the character c count times.
值得注意的是,如果我们使用 Java 11 或更高版本,String.valueOf(c).repeat(count) 是重复字符 c count 次的更好选择。
When we verify the decoding method using our example, the test passes:
当我们使用示例验证解码方法时,测试通过了:
assertEquals(INPUT, runLengthDecode(RLE));
4. Regex Based Solution
4.基于 Regex 的解决方案
Regex is a powerful tool for dealing with characters and strings. Let’s see if we can perform run-length encoding and decoding using regex.
Regex 是处理字符和字符串的强大工具。让我们看看能否使用 regex 执行运行长度编码和解码。
4.1. Creating the Encoding Method
4.1.创建编码方法
Let’s first look at the input string. If we can split it into a “run array”, then the problem can be easily solved:
让我们先看看输入字符串。如果我们能将其分割成一个 “运行数组”,那么问题就很容易解决了:
Input : "WWWWWWWWWWWWBAAACCDEEEEE"
Run Array : ["WWWWWWWWWWWW", "B", "AAA", "CC", "D", "EEEEE"]
We cannot split the input string by a character to achieve this. Instead, we must split by a zero-width position, such as the position between ‘W‘ and ‘B‘, ‘B‘ and ‘A“, etc.
为此,我们不能按字符分割输入字符串。相反,我们必须按零宽度位置分割,例如”W“和”B“、”B“和”A“之间的位置,等等。
It isn’t hard to discover the rules of these positions: the characters before and after the position are different. Therefore, we can build a regex to match the required positions using look-around and back reference: “(?<=(\\D))(?!\\1)“. Let’s quickly understand what this regex means:
要发现这些位置的规则并不难: 位置前后的字符是不同的。因此,我们可以使用 look-around 和后向引用构建一个 regex 来匹配所需的位置:”(?<=(\D))(?!\1)“。让我们快速了解一下这个 regex 的含义:
- (?<=(\\D)) – Positive look-behind assertion ensures the match is after a non-digit character (\\D represents a non-digit character).
- (?!\\1) – Negative lookahead assertion ensures the matched position isn’t before the same character as in the positive lookbehind. \\1 refers to the previously matched non-digit character.
The combination of these assertions ensures that the split occurs at the boundaries between consecutive runs of the same character.
这些断言的组合确保了分割发生在同一字符的连续运行之间的边界。
Next, let’s create the encoding method:
接下来,让我们创建编码方法:
String runLengthEncodeByRegEx(String input) {
String[] arr = input.split("(?<=(\\D))(?!\\1)");
StringBuilder result = new StringBuilder();
for (String run : arr) {
result.append(run.length()).append(run.charAt(0));
}
return result.toString();
}
As we can see, after we obtain the run array, the rest of the task is simply appending each run’s length and the character to the prepared StringBuilder.
正如我们所见,在获得运行数组后,剩下的任务就是将每个运行的长度和字符追加到准备好的 StringBuilder. 中。
The runLengthEncodeByRegEx() method passes our test:
runLengthEncodeByRegEx()方法通过了我们的测试:
assertEquals(RLE, runLengthEncodeByRegEx(INPUT));
4.2. Creating the Decoding Method
4.2.创建解码方法
We can follow a similar idea to decode a RLE-encoded string. First, we need to split the encoded string and obtain the following array:
我们可以按照类似的思路来解码 RLE 编码字符串。首先,我们需要分割编码后的字符串,得到以下数组:
RLE String: "12W1B3A2C1D5E"
Array : ["12", "W", "1", "B", "3", "A", "2", "C", "1", "D", "5", "E"]
Once we get this array, we can generate the decoded string by easily repeating each character, such as ‘W‘ 12 times, ‘B‘ once, etc.
得到这个数组后,我们就可以通过简单地重复每个字符来生成解码字符串,例如’W‘ 12次,’B‘ 一次等。
We’ll use the look-around technique again to create the regex to split the input string: “(?<=\\D)|(?=\\D+)“.
我们将再次使用查找技术创建 regex 来分割输入字符串:”(?<=\D)|(?=\D+)“.
In this regex:
在这个 regex 中
- (?<=\\D) – Positive look-behind assertion ensures that the split occurs after a non-digit character.
- | – Indicates the “OR” relation
- (?=\\D+) – Positive lookahead assertion ensures the split occurs before one or more non-digit characters.
This combination allows the split to occur at the boundaries between consecutive counts and characters in the RLE-encoded string.
这种组合允许在 RLE 编码字符串中的连续计数和字符之间进行分割。
Next, let’s build the decoding method based on the regex-based split:
接下来,让我们根据基于 regex 的分割建立解码方法:
String runLengthDecodeByRegEx(String rle) {
if (rle.isEmpty()) {
return "";
}
String[] arr = rle.split("(?<=\\D)|(?=\\D+)");
if (arr.length % 2 != 0) {
throw new IllegalArgumentException("Not a RLE string");
}
StringBuilder result = new StringBuilder();
for (int i = 1; i <= arr.length; i += 2) {
int count = Integer.parseInt(arr[i - 1]);
String c = arr[i];
result.append(String.join("", Collections.nCopies(count, c)));
}
return result.toString();
}
As the code shows, we included some simple validations at the beginning of the method. Then, while iterating the split array, we retrieved the count and character from the array and appended the character repeated count times to the result StringBuilder.
如代码所示,我们在方法的开头加入了一些简单的验证。然后,在遍历拆分数组时,我们从数组中获取计数和字符,并将重复计数的字符附加到 result StringBuilder 中。
Finally, this decoding method works with our example:
最后,这种解码方法适用于我们的示例:
assertEquals(INPUT, runLengthDecodeByRegEx(RLE));
5. Conclusion
5.结论
In this article, we first discussed how run-length encoding works and then explored two approaches to implementing run-length encoding and decoding.
在本文中,我们首先讨论了运行长度编码的工作原理,然后探讨了实现运行长度编码和解码的两种方法。
As always, the complete source code for the examples is available over on GitHub.
与往常一样,这些示例的完整源代码可在 GitHub 上获取。