Memory Barriers/Fences(内存关卡/栅栏 —原文)

In this article I‘ll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make
the memory state within a processor visible to other processors.

CPUs have employed many techniques to try and accommodate the fact that CPU execution unit performance has greatly outpaced main memory
performance.  In my “Write
Combining
” article I touched on just one of these techniques.  The most common technique employed by CPUs to hide memory latency
is to pipeline instructions and then spend significant effort, and resource, on trying to re-order these pipelines to minimise stalls related to cache misses.

When a program is executed it does not matter if its instructions are re-ordered provided the same end result is achieved.  For example,
within a loop it does not matter when the loop counter is updated if no operation within the loop uses it.  The compiler and CPU are free to re-order the instructions to best utilise the CPU provided it is updated by the time the next iteration is about to
commence.  Also over the execution of a loop this variable may be stored in a register and never pushed out to cache or main memory, thus it is never visible to another CPU.

CPU cores contain multiple execution units.  For example, a modern Intel CPU contains 6 execution units which can do a combination of
arithmetic, conditional logic, and memory manipulation.  Each execution unit can do some combination of these tasks.  These execution units operate in parallel allowing instructions to be executed in parallel.  This introduces another level of non-determinism
to program order if it was observed from another CPU.

Finally, when a cache-miss occurs, a modern CPU can make an assumption on the results of a memory load and continue executing based on
this assumption until the load returns the actual data.

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

Loads and stores to the caches and main memory are buffered and re-ordered using the load, store, and write-combining buffers.  These
buffers are associative queues that allow fast lookup.  This lookup is necessary when a later load needs to read the value of a previous store that has not yet reached the cache.  Figure 1 above depicts a simplified view of a modern multi-core CPU.  It shows
how the execution units can use the local registers and buffers to manage memory while it is being transferred back and forth from the cache sub-system.

In a multi-threaded environment techniques need to be employed for making program results visible in a timely manner.  I will not cover
cache coherence in this article.  Just assume that once memory has been pushed to the cache then a protocol of messages will occur to ensure all caches are coherent for any shared data.  The techniques for making memory visible from a processor core are known
as memory barriers or fences.

Memory barriers provide two properties.  Firstly, they preserve externally visible program order by ensuring all instructions either
side of the barrier appear in the correct program order if observed from another CPU and, secondly, they make the memory visible by ensuring the data is propagated to the cache sub-system.

Memory barriers are a complex subject.  They are implemented very differently across CPU architectures.  At one end of the spectrum there
is a relatively strong memory model on Intel CPUs that is more simple than say the weak and complex memory model on a DEC Alpha with its partitioned caches in addition to cache layers.  Since x86 CPUs are the most common for multi-threaded programming I’ll
try and simplify to this level.

Store Barrier

A store barrier, “sfence”
instruction on x86, forces all store instructions prior to the barrier to happen before the barrier and have the store buffers flushed to cache for the CPU on which it is issued.  This will make the program state visible to other CPUs so they can act on it
if necessary.  A good example of this in action is the following simplified code from theBatchEventProcessor in
the Disruptor.  When the sequence is updated other consumers and producers know how far this consumer has progressed and thus can take appropriate action.  All previous updates to memory that happened before the barrier are now visible.

private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;

// from inside the run() method

T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
    try
    {
        final long availableSequence = barrier.waitFor(nextSequence);

        while (nextSequence <= availableSequence)
        {
            event = ringBuffer.get(nextSequence);
            boolean endOfBatch = nextSequence == availableSequence;
            eventHandler.onEvent(event, nextSequence, endOfBatch);
            nextSequence++;
        }

        sequence.set(nextSequence - 1L);
        // store barrier inserted here !!!
    }
    catch (final Exception ex)
    {
        exceptionHandler.handle(ex, nextSequence, event);
        sequence.set(nextSequence);
        // store barrier inserted here !!!
        nextSequence++;
    }
}

Load Barrier

A load barrier, “lfence”
instruction on x86, forces all load instructions after the barrier to happen after the barrier and then wait on the load buffer to drain for that CPU.  This makes program state exposed from other CPUs visible to this CPU before making further progress.  A
good example of this is when the BatchEventProcessor sequence referenced above is read by producers, or consumers, in the corresponding barriers of the Disruptor.

Full Barrier

A full barrier, "mfence"
instruction on x86, is a composite of both load and store barriers happening on a CPU.

Java Memory Model

In the Java
Memory Model
 a volatile field
has a store barrier inserted after a write to it and a load barrier inserted before a read of it.  Qualified final fields
of a class have a store barrier inserted after their initialisation to ensure these fields are visible once the constructor completes when a reference to the object is available.

Atomic Instructions and Software Locks

Atomic instructions, such as the “lock ...”
instructions on x86, are effectively a full barrier as they lock the memory sub-system to perform an operation and have guaranteed total order, even across CPUs.  Software locks usually employ memory barriers, or atomic instructions, to achieve visibility
and preserve program order.

Performance Impact of Memory Barriers

Memory barriers prevent a CPU from performing a lot of techniques to hide memory latency therefore they have a significant performance
cost which must be considered.  To achieve maximum performance it is best to model the problem so the processor can do units of work, then have all the necessary memory barriers occur on the boundaries of these work units.  Taking this approach allows the
processor to optimise the units of work without restriction.  There is an advantage to grouping necessary memory barriers in that buffers flushed after the first one will be less costly because no work will be under way to refill them.

时间: 2024-08-06 08:03:44

Memory Barriers/Fences(内存关卡/栅栏 —原文)的相关文章

内存关卡/栅栏 ( Memory Barriers / Fences ) – 译

翻译自:Martin Thompson – Memory Barriers/Fences 在这篇文章里,我将讨论并发编程里最基础的技术–以内存关卡或栅栏著称,那让进程内的内存状态对其他进程可见. CPU 使用了很多技术去尝试和适应这样的事实:CPU 执行单元的性能已远远超出主内存性能.在我的"Writing Combining"文章,我只是谈及其中一种技术.CPU 使用的用来隐藏内存延迟的最普通技术是管线化指令,然后付出巨大努力和资源去尝试重排序这些管线来最小化缓存不命中的有关拖延.

技术向|内存屏障(Memory Barriers)--Runtime Time

在讨论CPU的内存屏障之前,让我们先了解一下缓存结构. 缓存(Cache)结构简介 现代计算机系统的缓存结构粗略如下: 每个CPU都有自己的缓存. 缓存(Cache)分为又分多个级别. 一级缓存L1的访问非常接近一个cpu周期(cycles),二级缓存L2的存取可能就要大概10个周期了. 缓存和内存交换数据的最小单元叫Cache Line.它是一个固定长度的块,可能是16到256字节(bytes). 比如一个32位的CPU有1M的缓存,每个Cache Line的大小是64bytes.那么这个缓存

Synthesis of memory barriers

A framework is provided for automatic inference of memory fences in concurrent programs. A method is provided for generating a set of ordering constraints that prevent executions of a program violating a specification. One or more incoming avoidable

Memory Barriers

这回该进入主题了. 上一文最后提到了 Memory Barriers ,即内存屏障,因为对一个 CPU 而言,a = 1; b = 1. 由于在中间加了内存屏障,在 X86 架构下,就是 mfence 指令,此时在上一文中运行时,情况就变成这样了,当 CPU0 发 出 "read invalidate" 消息后,就会开始执行 mfence 指令,该指令把 Store Buffer 中的项都标记一下,然后开始执行 b = 1,此时虽然 cache hint (cache 命中),但是由于

kafka 启动 报错cannot allocate memory,即内存不足

错误提示: Java Hotspot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00000000c5330000, 986513408, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memor

[Android Memory] App调试内存泄露之Context篇(上)

转载自:http://www.cnblogs.com/qianxudetianxia/p/3645106.html Context作为最基本的上下文,承载着Activity,Service等最基本组件.当有对象引用到Activity,并不能被回收释放,必将造成大范围的对象无法被回收释放,进而造成内存泄漏. 下面针对一些常用场景逐一分析. 1. CallBack对象的引用 先看一段代码: @Override protectedvoid onCreate(Bundle state){ super.o

[Android Memory] App调试内存泄露之Context篇(下)

转载地址:http://www.cnblogs.com/qianxudetianxia/p/3655475.html 5. AsyncTask对象 我N年前去盛大面过一次试,当时面试官极力推荐我使用AsyncTask等系统自带类去做事情,当然无可厚非. 但是AsyncTask确实需要额外注意一下.它的泄露原理和前面Handler,Thread泄露的原理差不多,它的生命周期和Activity不一定一致. 解决方案是:在activity退出的时候,终止AsyncTask中的后台任务. 但是,问题是如

Memory Leak(内存泄漏)问题总结(转)

最近听了一些关于Memory Leak(内存泄漏)的seminar,感觉有些收获,所以留个记录,并share给朋友. 1 什么是Memory Leak. Memory Leak是指由于错误或不完备的代码造成一些声明的对象实例长期占有内存空间,不能回收.Memory Leak会造成系统性能下降,或造成系统错误. 2 Memory存储模式 我们通常写的C++或Java Code在内存里边的存储状况概如下图. 简单的说,一般局部变量存储于Stack中,以提高运行问速度.而New出来的变量则将引用信息或

Android性能优化第(二)篇---Memory Monitor检测内存泄露

上篇说了一些性能优化的理论部分,主要是回顾一下,有了理论,小平同志又讲了,实践是检验真理的唯一标准,对于内存泄露的问题,现在通过Android Studio自带工具Memory Monitor 检测出来.性能优化的重要性不需要在强调,但是要强调一下,我并不是一个老司机,嘿嘿!没用过这个工具的,请睁大眼睛.如果你用过,那么就不用在看这篇博客了. 先看一段会发生内存泄露的代码 public class UserManger { private static UserManger instance;