An Introduction to Lock-Free Programming

original url: http://preshing.com/20120612/an-introduction-to-lock-free-programming/

What Is It?

People often describe lock-free programming as programming without mutexes, which are also referred to as locks. That’s true, but it’s only part of the story. The generally accepted definition, based on academic literature, is a bit more broad. At its essence, lock-free is a property used to describe some code, without saying too much about how that code was actually written.

Basically, if some part of your program satisfies the following conditions, then that part can rightfully be considered lock-free. Conversely, if a given part of your code doesn’t satisfy these conditions, then that part is not lock-free.

In this sense, the lock in lock-free does not refer directly to mutexes, but rather to the possibility of “locking up” the entire application in some way, whether it’s deadlock, livelock – or even due to hypothetical thread scheduling decisions made by your worst enemy. That last point sounds funny, but it’s key. Shared mutexes are ruled out trivially, because as soon as one thread obtains the mutex, your worst enemy could simply never schedule that thread again. Of course, real operating systems don’t work that way – we’re merely defining terms.

Here’s a simple example of an operation which contains no mutexes, but is still not lock-free. Initially, X = 0. As an exercise for the reader, consider how two threads could be scheduled in a way such that neither thread exits the loop.

while (X == 0)
{
    X = 1 - X;
}

Nobody expects a large application to be entirely lock-free. Typically, we identify a specific set of lock-free operations out of the whole codebase. For example, in a lock-free queue, there might be a handful of lock-free operations such as push, pop, perhaps isEmpty, and so on.

Herlihy & Shavit, authors of The Art of Multiprocessor Programming, tend to express such operations as class methods, and offer the following succinct definition of lock-free (see slide 150): “In an infinite execution, infinitely often some method call finishes.” In other words, as long as the program is able to keep calling those lock-free operations, the number of completed calls keeps increasing, no matter what. It is algorithmically impossible for the system to lock up during those operations.

One important consequence of lock-free programming is that if you suspend a single thread, it will never prevent other threads from making progress, as a group, through their own lock-free operations. This hints at the value of lock-free programming when writing interrupt handlers and real-time systems, where certain tasks must complete within a certain time limit, no matter what state the rest of the program is in.

A final precision: Operations that are designed to block do not disqualify the algorithm. For example, a queue’s pop operation may intentionally block when the queue is empty. The remaining codepaths can still be considered lock-free.

Lock-Free Programming Techniques

It turns out that when you attempt to satisfy the non-blocking condition of lock-free programming, a whole family of techniques fall out: atomic operations, memory barriers, avoiding the ABA problem, to name a few. This is where things quickly become diabolical.

So how do these techniques relate to one another? To illustrate, I’ve put together the following flowchart. I’ll elaborate on each one below.

Atomic Read-Modify-Write Operations

Atomic operations are ones which manipulate memory in a way that appears indivisible: No thread can observe the operation half-complete. On modern processors, lots of operations are already atomic. For example, aligned reads and writes of simple types are usually atomic.

Read-modify-write (RMW) operations go a step further, allowing you to perform more complex transactions atomically. They’re especially useful when a lock-free algorithm must support multiple writers, because when multiple threads attempt an RMW on the same address, they’ll effectively line up in a row and execute those operations one-at-a-time. I’ve already touched upon RMW operations in this blog, such as when implementing a lightweight mutex, a recursive mutex and a lightweight logging system.

Examples of RMW operations include _InterlockedIncrement on Win32, OSAtomicAdd32 on iOS, and std::atomic<int>::fetch_add in C++11. Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform, so it’s best to know the capabilities of your platform and toolchain. You can call std::atomic<>::is_lock_free to make sure.

Different CPU families support RMW in different ways. Processors such as PowerPC and ARM expose load-link/store-conditional instructions, which effectively allow you to implement your own RMW primitive at a low level, though this is not often done. The common RMW operations are usually sufficient.

As illustrated by the flowchart, atomic RMWs are a necessary part of lock-free programming even on single-processor systems. Without atomicity, a thread could be interrupted halfway through the transaction, possibly leading to an inconsistent state.

Compare-And-Swap Loops

Perhaps the most often-discussed RMW operation is compare-and-swap (CAS). On Win32, CAS is provided via a family of intrinsics such as _InterlockedCompareExchange. Often, programmers perform compare-and-swap in a loop to repeatedly attempt a transaction. This pattern typically involves copying a shared variable to a local variable, performing some speculative work, and attempting to publish the changes using CAS:

void LockFreeQueue::push(Node* newHead)
{
    for (;;)
    {
        // Copy a shared variable (m_Head) to a local.
        Node* oldHead = m_Head;

        // Do some speculative work, not yet visible to other threads.
        newHead->next = oldHead;

        // Next, attempt to publish our changes to the shared variable.
        // If the shared variable hasn‘t changed, the CAS succeeds and we return.
        // Otherwise, repeat.
        if (_InterlockedCompareExchange(&m_Head, newHead, oldHead) == oldHead)
            return;
    }
}

Such loops still qualify as lock-free, because if the test fails for one thread, it means it must have succeeded for another – though some architectures offer a weaker variant of CAS where that’s not necessarily true. Whenever implementing a CAS loop, special care must be taken to avoid the ABA problem.

Sequential Consistency

Sequential consistency means that all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code. Under sequential consistency, it’s impossible to experience memory reordering shenanigans like the one I demonstrated in a previous post.

A simple (but obviously impractical) way to achieve sequential consistency is to disable compiler optimizations and force all your threads to run on a single processor. A processor never sees its own memory effects out of order, even when threads are pre-empted and scheduled at arbitrary times.

Some programming languages offer sequentially consistency even for optimized code running in a multiprocessor environment. In C++11, you can declare all shared variables as C++11 atomic types with default memory ordering constraints. In Java, you can mark all shared variables as volatile. Here’s the example from my previous post, rewritten in C++11 style:

std::atomic<int> X(0), Y(0);
int r1, r2;

void thread1()
{
    X.store(1);
    r1 = Y.load();
}

void thread2()
{
    Y.store(1);
    r2 = X.load();
}

Because the C++11 atomic types guarantee sequential consistency, the outcome r1 = r2 = 0 is impossible. To achieve this, the compiler outputs additional instructions behind the scenes – typically memory fences and/or RMW operations. Those additional instructions may make the implementation less efficient compared to one where the programmer has dealt with memory ordering directly.

Memory Ordering

As the flowchart suggests, any time you do lock-free programming for multicore (or any symmetric multiprocessor), and your environment does not guarantee sequential consistency, you must consider how to prevent memory reordering.

On today’s architectures, the tools to enforce correct memory ordering generally fall into three categories, which prevent both compiler reordering and processor reordering:

  • A lightweight sync or fence instruction, which I’ll talk about in future posts;
  • A full memory fence instruction, which I’ve demonstrated previously;
  • Memory operations which provide acquire or release semantics.

Acquire semantics prevent memory reordering of operations which follow it in program order, and release semantics prevent memory reordering of operations preceding it. These semantics are particularly suitable in cases when there’s a producer/consumer relationship, where one thread publishes some information and the other reads it. I’ll also talk about this more in a future post.

Different Processors Have Different Memory Models

Different CPU families have different habits when it comes to memory reordering. The rules are documented by each CPU vendor and followed strictly by the hardware. For instance, PowerPC and ARM processors can change the order of memory stores relative to the instructions themselves, but normally, the x86/64 family of processors from Intel and AMD do not. We say the former processors have a more relaxed memory model.

There’s a temptation to abstract away such platform-specific details, especially with C++11 offering us a standard way to write portable lock-free code. But currently, I think most lock-free programmers have at least some appreciation of platform differences. If there’s one key difference to remember, it’s that at the x86/64 instruction level, every load from memory comes with acquire semantics, and every store to memory provides release semantics – at least for non-SSE instructions and non-write-combined memory. As a result, it’s been common in the past to write lock-free code which works on x86/64, but fails on other processors.

If you’re interested in the hardware details of how and why processors perform memory reordering, I’d recommend Appendix C of Is Parallel Programming Hard. In any case, keep in mind that memory reordering can also occur due to compiler reordering of instructions.

In this post, I haven’t said much about the practical side of lock-free programming, such as: When do we do it? How much do we really need? I also haven’t mentioned the importance of validating your lock-free algorithms. Nonetheless, I hope for some readers, this introduction has provided a basic familiarity with lock-free concepts, so you can proceed into the additional reading without feeling too bewildered. As usual, if you spot any inaccuracies, let me know in the comments.

 

Additional References

原文地址:https://www.cnblogs.com/kaleidoscope/p/9588260.html

时间: 2024-10-14 17:44:10

An Introduction to Lock-Free Programming的相关文章

海量并发的无锁编程 (lock free programming)

最近在做在线架构的实现,在线架构和离线架构近线架构最大的区别是服务质量(SLA,Service Level Agreement,SLA 99.99代表10K的请求最多一次失败或者超时)和延时.而离线架构在意的是吞吐,SLA的不会那么严苛,比如99.9.离线架构一般要有流控,以控制用户发送请求的速度.以免多于服务端处理能力的请求造成大量的数据在buffer或者队列里堆积,造成大量的超时.在线架构不可能有流控了,你不能限制用户的请求.因此在线架构对于弹性扩容有很高的要求,在大量请求到来时自动扩展后台

TIJ英文原版书籍阅读之旅——Chapter One:Introduction to Objects

///:~容我对这个系列美其名曰“读书笔记”,其实shi在练习英文哈:-) Introduction to Objects Object-oriented programming(OOP) is part of this movement toward using the computer as an expressive medium. This chapter will introduce you to the basic concepts of OOP, including an over

[Johns Hopkins] R Programming 作業 Week 2 - Air Pollution

Introduction For this first programming assignment you will write three functions that are meant to interact with dataset that accompanies this assignment. The dataset is contained in a zip file specdata.zip that you can download from the Coursera we

分布式系统(Distributed System)资料

这个资料关于分布式系统资料,作者写的太好了.拿过来以备用 网址:https://github.com/ty4z2008/Qix/blob/master/ds.md 希望转载的朋友,你可以不用联系我.但是一定要保留原文链接,因为这个项目还在继续也在不定期更新.希望看到文章的朋友能够学到更多. <Reconfigurable Distributed Storage for Dynamic Networks> 介绍:这是一篇介绍在动态网络里面实现分布式系统重构的paper.论文的作者(导师)是MIT

【转】游戏程序员养成计划

博客出处:www.cnblogs.com/clayman/archive/2009/05/17/1459001.html 作者:clayman 与玩游戏相比,写游戏要复杂上千万倍,除了需要掌握通用的编程技巧以外,还要有相当的图形学,物理,数学基础,特别是在国内,由于相关资料的缺乏,更是让初学者无从下手.下面总结了一些入门方法和比较容易入手的资料. 首先你要精通一门高级语言,pc上游戏的首选语言就是C++.其次,要有良好的英文阅读能力.对游戏开发者来说英文阅读能力是最重要也是最基本的工具之一,因为

书籍整理

数学类: 具体数学(高德纳) 矩阵计算 陶泽轩实分析 编程类: 计算机体系结构(en) 易读代码的艺术(en) 实用Common Lisp编程 JavaScript高级程序设计 JavaScript权威指南 松本行弘的程序世界 CSAPP(en) 高性能MySQL(第三版,淘宝团队) 程序员的自我修养 - 链接装载与库 计算机程序构造与解释 网络: 计算机网络(塔嫩鲍姆en) HTTP权威指南 TCP/IP协议原理与实现 两本CCNA考试用书 C与汇编: C语言接口与实现 C专家编程 C和指针(

HLSL的基础语法

HLSL的基本语法 1 数据类型 1.1   标量类型 1. bool: True or false .Note that the HLSL provides the true and false keywordslike in C++. 2. int: 32-bit signedinteger. 3. half: 16-bit-floatingpoint number. 4. float: 32-bit-floatingpoint number. 5. double: 64-bit-float

5.Qt model view设计模式

Introduction to Model/View Programming QT4 介绍了一系列新的 Item View 类,这些类使用Model/View结构来管理数据和数据如何呈现给用户.这种结构使程序员更加灵活的开发和定制呈现数据界面,Model/View结构提供标准的Model接口让各种数据资源都能够被存在的Item View使用. The model/view architecture MVC是一种源于 smalltalk的设计模式,经常用来构建应用程序界面. MVC有3个对象构成.

两天学会DirectX 3D之第二天

提要 前几天很简单地跑了一个DirectX 9 程序,以为DirectX就那么绘制,事实证明有点Naive了. 之前的那个程序最多也就是个固定流水线的东西.但是今天要用DirectX11来写一个小的框架. 龙书就不要看了,看Introduction to 3D GAME PROGRAMMING WITH DIRECTX®11 几个重要的类 ID3D11Device : 一个虚拟适配器:它被用于运行渲染和创建资源. ID3D11DeviceContext:  represents a device

Direct3D 10学习笔记(一)——初始化

本篇将简单整理Direct3D 10的初始化,具体内容参照< Introduction to 3D Game Programming with DirectX 10>(中文版有汤毅翻译的电子书<DirectX 10 3D游戏编程入门>). Direct3D 10的初始化可分为以下几个步骤: 1.填充一个DXGI_SWAP_CHAIN_DESC结构体,用于描述了所要创建的交换链特性. 2.使用D3D10CreateDeviceAndSwapChain函数创建ID3D10Device设