1. Overview
1.概述
Strings in Java are internally represented by a char[] containing the characters of the String. And, every char is made up of 2 bytes because Java internally uses UTF-16.
Java中的字符串在内部由char[]表示,其中包含字符串的字符。而且,每个char都是由2个字节组成的,因为Java内部使用UTF-16。
For instance, if a String contains a word in the English language, the leading 8 bits will all be 0 for every char, as an ASCII character can be represented using a single byte.
例如,如果一个String包含一个英语单词,每一个char的前8位将全部为0,因为一个ASCII字符可以用一个字节来表示。
Many characters require 16 bits to represent them but statistically most require only 8 bits — LATIN-1 character representation. So, there is a scope to improve the memory consumption and performance.
许多字符需要16位来表示,但据统计,大多数字符只需要8位–LATIN-1字符表示。因此,存在着改善内存消耗和性能的空间。
What’s also important is that Strings typically usually occupy a large proportion of the JVM heap space. And, because of the way they’re stored by the JVM, in most cases, a String instance can take up double space it actually needs.
同样重要的是,字符串通常会占据JVM堆的很大一部分空间。而且,由于JVM的存储方式,在大多数情况下,一个字符串实例可以占用双倍的空间它实际需要。
In this article, we’ll discuss the Compressed String option, introduced in JDK6 and the new Compact String, recently introduced with JDK9. Both of these were designed to optimize memory consumption of Strings on the JMV.
在这篇文章中,我们将讨论JDK6中引入的压缩字符串选项和最近在JDK9中引入的新的紧凑字符串。这两个选项都是为了优化JMV上的字符串的内存消耗。
2. Compressed String – Java 6
2.压缩的字符串 – Java 6
The JDK 6 update 21 Performance Release, introduced a new VM option:
JDK 6更新21性能版,引入了一个新的虚拟机选项。
-XX:+UseCompressedStrings
When this option is enabled, Strings are stored as byte[], instead of char[] – thus, saving a lot of memory. However, this option was eventually removed in JDK 7, mainly because it had some unintended performance consequences.
当这个选项被启用时,字符串被存储为byte[],而不是char[] – 因此,节省了大量的内存。然而,这个选项最终在JDK 7中被移除,主要是因为它有一些意想不到的性能后果。
3. Compact String – Java 9
3.紧凑的字符串 – Java 9
Java 9 has brought the concept of compact Strings back.
Java 9带来了紧凑的字符串概念,ck。
This means that whenever we create a String if all the characters of the String can be represented using a byte — LATIN-1 representation, a byte array will be used internally, such that one byte is given for one character.
这意味着当我们创建一个字符串时,如果该字符串的所有字符都可以用字节-LATIN-1表示,内部将使用一个字节数组,这样,一个字符就有一个字节。
In other cases, if any character requires more than 8-bits to represent it, all the characters are stored using two bytes for each — UTF-16 representation.
在其他情况下,如果任何字符需要超过8位来表示,所有的字符都用两个字节来存储–UTF-16表示。
So basically, whenever possible, it’ll just use a single byte for each character.
所以基本上,只要有可能,它就会为每个字符使用一个字节。
Now, the question is – how will all the String operations work? How will it distinguish between the LATIN-1 and UTF-16 representations?
现在,问题是–所有的String操作将如何工作?它将如何区分LATIN-1和UTF-16的表述?
Well, to tackle this issue, another change is made to the internal implementation of the String. We have a final field coder, that preserves this information.
为了解决这个问题,我们对String的内部实现进行了另一次修改。我们有一个最终的字段coder,它保留了这个信息。。
3.1. String Implementation in Java 9
3.1.Java 9中的String实现
Until now, the String was stored as a char[]:
直到现在,String被存储为char[]。
private final char[] value;
From now on, it’ll be a byte[]:
从现在开始,它将是一个byte[]:。
private final byte[] value;
The variable coder:
变量coder。
private final byte coder;
Where the coder can be:
其中,编码器可以是。
static final byte LATIN1 = 0;
static final byte UTF16 = 1;
Most of the String operations now check the coder and dispatch to the specific implementation:
大多数String操作现在都会检查编码,并分派给具体的实现。
public int indexOf(int ch, int fromIndex) {
return isLatin1()
? StringLatin1.indexOf(value, ch, fromIndex)
: StringUTF16.indexOf(value, ch, fromIndex);
}
private boolean isLatin1() {
return COMPACT_STRINGS && coder == LATIN1;
}
With all the info the JVM needs ready and available, the CompactString VM option is enabled by default. To disable it, we can use:
由于JVM需要的所有信息都已准备好并可用,CompactString VM选项默认是启用的。要禁用它,我们可以使用。
+XX:-CompactStrings
3.2. How coder Works
3.2.coder如何工作
In Java 9 String class implementation, the length is calculated as:
在Java 9 String类的实现中,长度的计算方法是。
public int length() {
return value.length >> coder;
}
If the String contains only LATIN-1, the value of the coder will be 0 so the length of the String will be the same as the length of the byte array.
如果String只包含LATIN-1,coder的值将是0,所以String的长度将与字节数组的长度相同。
In other cases, if the String is in UTF-16 representation, the value of coder will be 1, and hence the length will be half the size of the actual byte array.
在其他情况下,如果String是UTF-16表示,coder的值将是1,因此长度将是实际字节数组的一半。
Note that all the changes made for Compact String, are in the internal implementation of the String class and are fully transparent for developers using String.
请注意,为Compact String,所做的所有更改都在String类的内部实现中,对于使用String的开发者来说是完全透明的。
4. Compact Strings vs. Compressed Strings
4.紧凑的弦与压缩的弦
In case of JDK 6 Compressed Strings, a major problem faced was that the String constructor accepted only char[] as an argument. In addition to this, many String operations depended on char[] representation and not a byte array. Due to this, a lot of unpacking had to be done, which affected the performance.
在JDK 6压缩字符串中,面临的一个主要问题是字符串构造函数只接受char[]作为参数。除此之外,许多String操作都依赖于char[]表示,而不是字节数组。由于这个原因,必须进行大量的解包,这影响了性能。
Whereas in case of Compact String, maintaining the extra field “coder” can also increase the overhead. To mitigate the cost of the coder and the unpacking of bytes to chars (in case of UTF-16 representation), some of the methods are intrinsified and the ASM code generated by the JIT compiler has also been improved.
而在紧凑型字符串的情况下,维护额外的字段 “编码器 “也会增加开销。为了减轻编码器和将字节解包为字符的成本(在UTF-16表示法的情况下),一些方法被intrinsified,由JIT编译器生成的ASM代码也被改进。
This change resulted in some counter-intuitive results. The LATIN-1 indexOf(String) calls an intrinsic method, whereas the indexOf(char) does not. In case of UTF-16, both of these methods call an intrinsic method. This issue affects only the LATIN-1 String and will be fixed in future releases.
这一变化导致了一些反直觉的结果。LATIN-1的indexOf(String)调用了一个内在的方法,而indexOf(char)没有。在UTF-16的情况下,这些方法都会调用一个内在的方法。这个问题只影响到LATIN-1 String,并将在未来的版本中修复。
Thus, Compact Strings are better than the Compressed Strings in terms of performance.
因此,紧凑型弦在性能上比压缩型弦更好。
To find out how much memory is saved using the Compact Strings, various Java application heap dumps were analyzed. And, while results were heavily dependent on the specific applications, the overall improvements were almost always considerable.
为了了解使用Compact Strings可以节省多少内存,我们分析了各种Java应用程序的堆转储。虽然结果在很大程度上取决于具体的应用程序,但总体上的改进几乎总是相当大的。
4.1. Difference in Performance
4.1.性能上的差异
Let’s see a very simple example of the performance difference between enabling and disabling Compact Strings:
让我们看一个非常简单的例子,说明启用和禁用Compact Strings的性能差异:。
long startTime = System.currentTimeMillis();
List strings = IntStream.rangeClosed(1, 10_000_000)
.mapToObj(Integer::toString)
.collect(toList());
long totalTime = System.currentTimeMillis() - startTime;
System.out.println(
"Generated " + strings.size() + " strings in " + totalTime + " ms.");
startTime = System.currentTimeMillis();
String appended = (String) strings.stream()
.limit(100_000)
.reduce("", (l, r) -> l.toString() + r.toString());
totalTime = System.currentTimeMillis() - startTime;
System.out.println("Created string of length " + appended.length()
+ " in " + totalTime + " ms.");
Here, we are creating 10 million Strings and then appending them in a naive manner. When we run this code (Compact Strings are enabled by default), we get the output:
在这里,我们正在创建1000万个Strings,然后以一种天真的方式追加它们。当我们运行这段代码时(Compact Strings默认启用),我们得到的输出是。
Generated 10000000 strings in 854 ms.
Created string of length 488895 in 5130 ms.
Similarly, if we run it by disabling the Compact Strings using: -XX:-CompactStrings option, the output is:
同样地,如果我们通过使用以下方法禁用紧凑字符串来运行它。-XX:-CompactStrings选项,输出结果是。
Generated 10000000 strings in 936 ms.
Created string of length 488895 in 9727 ms.
Clearly, this is a surface level test, and it can’t be highly representative – it’s only a snapshot of what the new option may do to improve performance in this particular scenario.
显然,这只是一个表面的测试,它不可能有很强的代表性–它只是新选项在这个特定情况下对改善性能的一个快照。
5. Conclusion
5.结论
In this tutorial, we saw the attempts to optimize the performance and memory consumption on the JVM – by storing Strings in a memory efficient way.
在本教程中,我们看到了优化JVM性能和内存消耗的尝试–通过以一种有效的内存方式存储Strings。
As always, the entire code is available over on Github.
一如既往,整个代码可在Github上获取。