1. Overview
1.概述
In this article, we’ll see how sometimes false sharing can turn multithreading against us.
在这篇文章中,我们将看到有时错误的共享会使多线程对我们不利。
First, we’re going to start with a little bit on the theory of caching and spatial locality. Then we’ll rewrite the LongAdder concurrent utility and benchmark it against the java.util.concurrent implementation. Throughout the article, we’ll use the benchmark results at different levels to investigate the effect of false sharing.
首先,我们将从缓存和空间定位的理论入手,了解一下。然后我们将重写LongAdder并发工具,并将其与java.util.concurrent实现进行基准测试。在整个文章中,我们将使用不同级别的基准测试结果来研究错误共享的影响。
The Java-related part of the article depends heavily on the memory layout of objects. Since these layout details are not part of the JVM specification and are left to the discretion of the implementor, we’ll only focus on one specific JVM implementation: the HotSpot JVM. We may also use the JVM and HotSpot JVM terms interchangeably throughout the article.
文章中与Java有关的部分在很大程度上取决于对象的内存布局。由于这些布局细节不是 JVM 规范的一部分,而是由实现者决定的,因此我们将只关注一个特定的 JVM 实现:HotSpot JVM。我们也可以在文章中交替使用JVM和HotSpot JVM的术语。
2. Cache Line and Coherency
2.缓存线和相干性
Processors use different levels of caching — when a processor reads a value from the main memory, it may cache that value to improve performance.
处理器使用不同级别的缓存–当处理器从主内存读取一个值时,它可能会缓存该值以提高性能。
As it turns out, most modern processors not only cache the requested value but also cache a few more nearby values. This optimization is based on the idea of spatial locality and can significantly improve the overall performance of applications. Simply put, processor caches are working in terms of cache lines, instead of single cacheable values.
事实证明,大多数现代处理器不仅缓存了请求的值,而且还缓存了附近的一些值。这种优化是基于空间定位的理念,可以显著提高应用程序的整体性能。简单地说,处理器的缓存是以缓存行为单位工作的,而不是以单一的可缓存值为单位。
When multiple processors are operating on the same or nearby memory locations, they may end up sharing the same cache line. In such situations, it’s essential to keep those overlapping caches in different cores consistent with each other. The act of maintaining such consistency is called cache coherency.
当多个处理器在相同或邻近的内存位置上运行时,它们最终可能会共享同一个缓存线。在这种情况下,保持不同内核中那些重叠的高速缓存相互一致是非常重要的。保持这种一致性的行为被称为缓存一致性。
There are quite a few protocols to maintain the cache coherency between CPU cores. In this article, we’re going to talk about the MESI protocol.
有相当多的协议来维持CPU核之间的缓存一致性。在这篇文章中,我们要谈的是MESI协议。
2.1. The MESI Protocol
2.1.MESI协议
In the MESI protocol, each cache line can be in one of these four distinct states: Modified, Exclusive, Shared, or Invalid. The word MESI is the acronym of these states.
在MESI协议中,每条高速缓存线可以处于这四种不同的状态之一。修改、独占、共享或无效。MESI一词是这些状态的缩写。
To better understand how this protocol works, let’s walk through an example. Suppose two cores are going to read from nearby memory locations:
为了更好地理解这个协议是如何工作的,让我们通过一个例子。假设两个内核要从附近的内存位置读取数据。
Core A reads the value of a from the main memory. As shown above, this core fetches a few more values from the memory and stores them into a cache line. Then it marks that cache line as exclusive since core A is the only core operating on this cache line. From now on, when possible, this core will avoid the inefficient memory access by reading from the cache line instead.
核心A从主存储器中读取a的值。如上图所示,这个核心又从内存中获取了几个值,并将它们存储到一个缓存行中。然后它将该缓存行标记为exclusive,因为核心A是唯一在该缓存行上操作的核心。从现在开始,在可能的情况下,这个核心将通过从缓存行读取来避免低效的内存访问。
After a while, core B also decides to read the value of b from the main memory:
一段时间后,核心B也决定从主内存中读取b的值。
Since a and b are so close to each other and reside in the same cache line, both cores will tag their cache lines as shared.
由于a和b相互之间非常接近,并且位于同一个缓存线中,两个内核都会将其缓存线标记为共享。
Now, let’s suppose that core A decides to change the value of a:
The core A stores this change only in its store buffer and marks its cache line as modified. Also, it communicates this change to the core B, and this core will, in turn, mark its cache line as invalid.
核心A只在其存储缓冲区中存储这一变化,并将其缓存行标记为修改。同时,它将这一变化传达给核心B,而这个核心又将其缓存行标记为无效的。
That’s how different processors make sure that their caches are coherent with each other.
这就是不同的处理器如何确保它们的缓存彼此一致的原因。
3. False Sharing
3.虚假的分享
Now, let’s see what happens when core B decides to re-read the value of b. As this value didn’t change recently, we might expect a fast read from the cache line. However, the nature of shared multiprocessor architecture invalidates this expectation in reality.
现在,让我们看看当核心B决定重新读取b的值时会发生什么。由于这个值最近没有变化,我们可能期望从缓存行快速读取。然而,共享多处理器架构的性质使这种期望在现实中失效。
As mentioned earlier, the whole cache line was shared between the two cores. Since the cache line for core B is invalid now, it should read the value b from the main memory again:
如前所述,整个缓存线是由两个核心共享的。由于核心B的缓存行现在无效,它应该再次从主内存中读取b值。
As shown above, reading the same b value from the main memory is not the only inefficiency here. This memory access will force the core A to flush its store buffer, as the core B needs to get the latest value. After flushing and fetching the values, both cores will end up with the latest cache line version tagged in the shared state again:
如上所示,从主内存读取相同的b 值并不是这里唯一的低效率。这种内存访问将迫使核心A刷新其存储缓冲区,因为核心B需要获取最新的值。在冲刷和获取值之后,两个核心最终都将在共享状态中再次标记最新的缓存线版本。
So, this imposes a cache miss to one core and an early buffer flush to another one, even though the two cores weren’t operating on the same memory location. This phenomenon, known as false sharing, can hurt the overall performance, especially when the rate of the cache misses is high. To be more specific, when this rate is high, processors will constantly reach out to the main memory instead of reading from their caches.
因此,这就给一个内核带来了缓存缺失,而给另一个内核带来了早期缓冲区刷新,尽管这两个内核并没有在同一个内存位置上操作。这种现象被称为虚假共享,会损害整体性能,尤其是当高速缓冲区缺失率很高时。更具体地说,当这一比率很高时,处理器会不断向主内存伸出手来,而不是从其缓存中读取。
4. Example: Dynamic Striping
4.例子 动态剥落
To demonstrate how false sharing can affect the throughput or latency of applications, we’re going to cheat in this section. Let’s define two empty classes:
为了证明虚假共享如何影响应用程序的吞吐量或延迟,我们将在本节中作弊。让我们定义两个空类。
abstract class Striped64 extends Number {}
public class LongAdder extends Striped64 implements Serializable {}
Of course, empty classes aren’t that useful, so let’s copy-paste some logic into them.
当然,空的类并不那么有用,所以让我们把一些逻辑复制粘贴到它们里面。
For our Striped64 class, we can copy everything from the java.util.concurrent.atomic.Striped64 class and paste it into our class. Please make sure to copy the import statements, too. Also, if using Java 8, we should make sure to replace any call to sun.misc.Unsafe.getUnsafe() method to a custom one:
对于我们的Striped64类,我们可以从java.util.concurrent.atomic.Striped64类中复制一切,并将其粘贴到我们的类。请确保把import语句也复制下来。另外,如果使用Java 8,我们应该确保将对sun.misc.Unsafe.getUnsafe() 方法的任何调用替换为自定义方法。
private static Unsafe getUnsafe() {
try {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
return (Unsafe) field.get(null);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
We can’t call the sun.misc.Unsafe.getUnsafe() from our application classloader, so we have to cheat again with this static method. As of Java 9, however, the same logic is implemented using VarHandles, so we won’t need to do anything special there, and just a simple copy-paste would suffice.
我们不能从我们的应用程序类加载器中调用sun.misc.Unsafe.getUnsafe(),所以我们不得不再次用这个静态方法作弊。然而,从Java 9开始,同样的逻辑是用VarHandles实现的,所以我们不需要在那里做任何特别的事情,只要简单地复制粘贴就够了。
For the LongAdder class, let’s copy everything from the java.util.concurrent.atomic.LongAdder class and paste it into ours. Again, we should copy the import statements, too.
对于LongAdder类,让我们从java.util.concurrent.atomic.LongAdder类中复制一切,并将其粘贴到我们的类中。同样,我们也应该复制import语句。
Now, let’s benchmark these two classes against each other: our custom LongAdder and java.util.concurrent.atomic.LongAdder.
现在,让我们对这两个类进行基准测试:我们的自定义LongAdder和java.util.concurrent.atomic.LongAdder。。
4.1. Benchmark
4.1 基准
To benchmark these classes against each other, let’s write a simple JMH benchmark:
为了对这些类进行基准测试,让我们编写一个简单的JMH>基准测试。
@State(Scope.Benchmark)
public class FalseSharing {
private java.util.concurrent.atomic.LongAdder builtin = new java.util.concurrent.atomic.LongAdder();
private LongAdder custom = new LongAdder();
@Benchmark
public void builtin() {
builtin.increment();
}
@Benchmark
public void custom() {
custom.increment();
}
}
If we run this benchmark with two forks and 16 threads in throughput benchmark mode (the equivalent of passing “–-bm thrpt -f 2 -t 16″ arguments), then JMH will print these stats:
如果我们在吞吐量基准模式下用两个分叉和16个线程来运行这个基准(相当于通过“–-bm thrpt -f 2 -t 16″参数),那么JMH将打印这些统计数据。
Benchmark Mode Cnt Score Error Units
FalseSharing.builtin thrpt 40 523964013.730 ± 10617539.010 ops/s
FalseSharing.custom thrpt 40 112940117.197 ± 9921707.098 ops/s
The result doesn’t make sense at all. The JDK built-in implementation dwarfs our copy-pasted solution by almost 360% more throughput.
这个结果完全没有意义。JDK的内置实现使我们复制粘贴的解决方案相形见绌,吞吐量几乎高出360%。
Let’s see the difference between latencies:
让我们看看延迟的区别。
Benchmark Mode Cnt Score Error Units
FalseSharing.builtin avgt 40 28.396 ± 0.357 ns/op
FalseSharing.custom avgt 40 51.595 ± 0.663 ns/op
As shown above, the built-in solution also has better latency characteristics.
如上所示,内置解决方案也有更好的延迟特性。
To better understand what’s so different about these seemingly identical implementations, let’s inspect some low-level performance monitoring counters.
为了更好地理解这些看似相同的实现有什么不同,让我们来检查一些低级别的性能监测计数器。
5. Perf Events
5.灌音活动
To instrument low-level CPU events, such as cycles, stall cycles, instructions per cycle, cache loads/misses, or memory loads/stores, we can program special hardware registers on the processors.
为了测量低级别的CPU事件,如周期、失速周期、每周期指令、缓存加载/缺失或内存加载/存储,我们可以在处理器上编制特殊硬件寄存器。
As it turns out, tools like perf or eBPF are already using this approach to expose useful metrics. As of Linux 2.6.31, perf is the standard Linux profiler capable of exposing useful Performance Monitoring Counters or PMCs.
事实证明,像perf或eBPF这样的工具已经在使用这种方法来暴露有用的指标。从 Linux 2.6.31 起,perf是标准的 Linux 剖析器,能够暴露出有用的性能监控计数器或 PMCs。
So, we can use perf events to see what’s going on at the CPU level when running each of these two benchmarks. For instance, if we run:
因此,我们可以使用perf事件来查看在运行这两个基准的每一个时在CPU层面上发生了什么。例如,如果我们运行。
perf stat -d java -jar benchmarks.jar -f 2 -t 16 --bm thrpt custom
Perf will make JMH run the benchmarks against the copy-pasted solution and print the stats:
Perf将使JMH对复制粘贴的解决方案进行基准测试,并打印出统计结果。
161657.133662 task-clock (msec) # 3.951 CPUs utilized
9321 context-switches # 0.058 K/sec
185 cpu-migrations # 0.001 K/sec
20514 page-faults # 0.127 K/sec
0 cycles # 0.000 GHz
219476182640 instructions
44787498110 branches # 277.052 M/sec
37831175 branch-misses # 0.08% of all branches
91534635176 L1-dcache-loads # 566.227 M/sec
1036004767 L1-dcache-load-misses # 1.13% of all L1-dcache hits
The L1-dcache-load-misses field represents the number of cache misses for the L1 data cache. As shown above, this solution has encountered around one billion cache misses (1,036,004,767 to be exact). If we gather the same stats for the built-in approach:
L1-dcache-load-misses字段表示L1数据缓存的缓存失误次数。如上图所示,这个解决方案遇到了大约10亿次的缓存失误(准确的说是1,036,004,767次)。如果我们为内置方法收集同样的统计数据。
161742.243922 task-clock (msec) # 3.955 CPUs utilized
9041 context-switches # 0.056 K/sec
220 cpu-migrations # 0.001 K/sec
21678 page-faults # 0.134 K/sec
0 cycles # 0.000 GHz
692586696913 instructions
138097405127 branches # 853.812 M/sec
39010267 branch-misses # 0.03% of all branches
291832840178 L1-dcache-loads # 1804.308 M/sec
120239626 L1-dcache-load-misses # 0.04% of all L1-dcache hits
We would see that it encounters a lot fewer cache misses (120,239,626 ~ 120 million) compared to the custom approach. Therefore, the high number of cache misses might be the culprit for such a difference in performance.
我们会看到,与自定义方法相比,它遇到的缓存缺失次数要少得多(120,239,626 ~ 1.2亿)。因此,高数量的缓存缺失可能是造成这种性能差异的罪魁祸首。
Let’s dig even deeper into the internal representation of LongAdder to find the actual culprit.
让我们更深入地挖掘LongAdder的内部表现,以找到真正的罪魁祸首。
6. Dynamic Striping Revisited
6.重新审视动态条带化
The java.util.concurrent.atomic.LongAdder is an atomic counter implementation with high throughput. Instead of just using one counter, it’s using an array of them to distribute the memory contention between them. This way, it will outperform the simple atomics such as AtomicLong in highly contended applications.
java.util.concurrent.atomic.LongAdder 是一个具有高吞吐量的原子计数器实现。它不是只使用一个计数器,而是使用一个数组来分配它们之间的内存争用。这样一来,在高度竞争的应用程序中,它的性能将优于简单的原子学,如AtomicLong。
The Striped64 class is responsible for this distribution of memory contention, and this is how this class implements those array of counters:
Striped64 类负责这种内存争夺的分配,这就是这个类实现那些计数器阵列的方式。
@jdk.internal.vm.annotation.Contended
static final class Cell {
volatile long value;
// omitted
}
transient volatile Cell[] cells;
Each Cell encapsulates the details for each counter. This implementation makes it possible for different threads to update different memory locations. Since we’re using an array (that is, stripes) of states, this idea is called dynamic striping. Interestingly, Striped64 is named after this idea and the fact that it works on 64-bit data types.
每个Cell都封装了每个计数器的细节。这种实现方式使得不同的线程可以更新不同的内存位置。由于我们使用的是一个数组(也就是条纹)的状态,这个想法被称为动态条纹。有趣的是,Striped64是以这个想法和它在64位数据类型上工作的事实命名的。
Anyway, the JVM may allocate those counters near each other in the heap. That is, a few those counters will be in the same cache line. Therefore, updating one counter may invalidate the cache for nearby counters.
不管怎么说,JVM可能会在堆中把这些计数器分配到彼此附近。也就是说,几个这样的计数器会在同一个缓存行中。因此,更新一个计数器可能会使附近计数器的缓存失效。
The key takeaway here is, the naive implementation of dynamic striping will suffer from false sharing. However, by adding enough padding around each counter, we can make sure that each of them resides on its cache line, thus preventing the false sharing:
这里的关键是,天真的动态条带化实现会受到虚假共享的影响。然而,通过在每个计数器周围添加足够的填充,我们可以确保每个计数器都驻留在其缓存行上,从而防止虚假共享。
As it turns out, the @jdk.internal.vm.annotation.Contended annotation is responsible for adding this padding.
事实证明,@jdk.internal.vm.annotation.Contendedannotation负责添加这个padding。
The only question is, why didn’t this annotation work in the copy-pasted implementation?
唯一的问题是,为什么这个注释在复制粘贴的实现中不起作用?
7. Meet @Contended
7.满足@Contended的要求
Java 8 introduced the sun.misc.Contended annotation (Java 9 repackaged it under the jdk.internal.vm.annotation package) to prevent false sharing.
Java 8引入了sun.misc.Contended注解(Java 9将其重新打包在jdk.internal.vm.annotation包下),以防止虚假共享。
Basically, when we annotate a field with this annotation, the HotSpot JVM will add some paddings around the annotated field. This way, it can make sure that the field resides on its own cache line. Moreover, if we annotate a whole class with this annotation, the HotSopt JVM will add the same padding before all the fields.
基本上,当我们用这个注解来注解一个字段时,HotSpot JVM将在被注解的字段周围添加一些填充物。这样,它就可以确保该字段驻留在自己的缓存行中。此外,如果我们用这个注解来注解整个类,HotSopt JVM将在所有字段之前添加同样的填充物。
The @Contended annotation is meant to be used internally by the JDK itself. So by default, it doesn’t affect the memory layout of non-internal objects. That’s the reason why our copy-pasted adder doesn’t perform as well as the built-in one.
@Contended 注解是为了让 JDK 自己在内部使用。所以默认情况下,它不会影响非内部对象的内存布局。这就是为什么我们的复制粘贴加法器的性能不如内置加法器的原因。
To remove this internal-only restriction, we can use the -XX:-RestrictContended tuning flag when rerunning the benchmark:
为了移除这个仅在内部的限制,我们可以在重新运行基准时使用-XX:-RestrictContended调谐标志。
Benchmark Mode Cnt Score Error Units
FalseSharing.builtin thrpt 40 541148225.959 ± 18336783.899 ops/s
FalseSharing.custom thrpt 40 546022431.969 ± 16406252.364 ops/s
As shown above, now the benchmark results are much closer, and the difference probably is just a bit of noise.
如上图所示,现在的基准测试结果更加接近,差异可能只是一点噪音。
7.1. Padding Size
7.1.衬垫尺寸
By default, the @Contended annotation adds 128 bytes of padding. That’s mainly because the cache line size in many modern processors is around 64/128 bytes.
默认情况下,@Contended 注解会增加128字节的填充。这主要是因为许多现代处理器的缓存行大小约为64/128字节。
This value, however, is configurable through the -XX:ContendedPaddingWidth tuning flag. As of this writing, this flag only accepts values between 0 and 8192.
然而,这个值可以通过-XX:ContendedPaddingWidth调整标志进行配置。截至目前,这个标志只接受0到8192之间的值。
7.2. Disabling the @Contended
7.2.禁用@Contended
It’s also possible to disable the @Contended effect via the -XX:-EnableContended tuning. This may prove to be useful when the memory is at a premium and we can afford to lose a bit (and sometimes a lot) of performance.
也可以通过-XX:-EnableContended调整来禁用@Contended的效果。这可能会被证明是有用的,当内存处于溢价状态时,我们可以承受损失一点(有时是很多)的性能。
7.3. Use Cases
7.3.使用案例
After its first release, the @Contended annotation has been used quite extensively to prevent false sharing in JDK’s internal data structures. Here are a few notable examples of such implementations:
在首次发布后,@Contended 注解已被相当广泛地用于防止JDK内部数据结构的错误共享。下面是这种实现的几个值得注意的例子。
- The Striped64 class to implement counters and accumulators with high throughput
- The Thread class to facilitate the implementation of efficient random number generators
- The ForkJoinPool work-stealing queue
- The ConcurrentHashMap implementation
- The dual data structure used in the Exchanger class
8. Conclusion
8.结语
In this article, we saw how sometimes false sharing might cause counterproductive effects on the performance of multithreaded applications.
在这篇文章中,我们看到了有时错误的共享可能会对多线程应用程序的性能造成适得其反的影响。
To make matters more concrete, we did benchmark the LongAdder implementation in Java against its copy and used its results as a starting point for our performance investigations.
为了使事情更加具体,我们确实对Java中的LongAdder实现与其副本进行了基准测试,并将其结果作为我们性能调查的起点。
Also, we used the perf tool to gather some stats about the performance metrics of a running application on Linux. To see more examples of perf, it’s highly recommended to read Branden Greg‘s blog. Moreover, eBPF, available as of Linux Kernel version 4.4, can also be useful in many tracing and profiling scenarios.
此外,我们还使用perf工具来收集一些关于Linux上运行的应用程序的性能指标的统计数据。要看到更多的perf的例子,强烈建议阅读Branden Greg的博客。此外,eBPF(从Linux Kernel version 4.4开始提供)也可以在许多跟踪和剖析场景中发挥作用。
As usual, all the examples are available over on GitHub.
像往常一样,所有的例子都可以在GitHub上找到。