【转】A Fast General Purpose Lock-Free Queue for C++

From:http://moodycamel.com/blog/2014/a-fast-general-purpose-lock-free-queue-for-c++

So I‘ve been bitten by the lock-free bug! After finishing my single-producer, single-consumer lock-free queue, I decided to design and implement a more general multi-producer, multi-consumer queue and see how it stacked up against existing lock-free C++ queues. After over a year of spare-time development and testing, it‘s finally time for a public release. TL;DR: You can grab the C++11 implementation from GitHub (or jump to the benchmarks).

The way the queue works is interesting, I think, so that‘s what this blog post is about. A much more detailed and complete (but also more dry) description is available in a sister blog post, by the way.

Sharing data: Oh, the woes

At first glance, a general purpose lock-free queue seems fairly easy to implement. It isn‘t. The root of the problem is that the same variables necessarily need to be shared with several threads. For example, take a common linked-list based approach: At a minimum, the head and tail of the list need to be shared, because consumers all need to be able to read and update the head, and the producers all need to be able to update the tail.

This doesn‘t sound too bad so far, but the real problems arise when a thread needs to update more than one variable to keep the queue in a consistent state -- atomicity is only ensured for single variables, and atomicity for compound variables (structs) is almost certainly going to result in a sort of lock (on most platforms, depending on the size of the variable). For example, what if a consumer read the last item from the queue and updated only the head? The tail should not still point to it, because the object will soon be freed! But the consumer could be interrupted by the OS and suspended for a few milliseconds before it updates the tail, and during that time the tail could be updated by another thread, and then it becomes too late for the first thread to set it to null.

The solutions to this fundamental problem of shared data are the crux of lock-free programming. Often the best way is to conceive of an algorithm that doesn‘t need to update multiple variables to maintain consistency in the first place, or one where incremental updates still leave the data structure in a consistent state. Various tricks can be used, such as never freeing memory once allocated (this helps with reads from threads that aren‘t up to date), storing extra state in the last two bits of a pointer (this works with 4-byte aligned pointers), and reference counting pointers. But tricks like these only go so far; the real effort goes into developing the algorithms themselves.

My queue

The less threads fight over the same data, the better. So, instead of using a single data structure that linearizes all operations, a set of sub-queues is used instead -- one for each producer thread. This means that different threads can enqueue items completely in parallel, independently of each other.

Of course, this also makes dequeueing slightly more complicated: Now we have to check every sub-queue for items when dequeuing. Interestingly, it turns out that the order that elements are pulled from the sub-queues really doesn‘t matter. All elements from a given producer thread will necessarily still be seen in that same order relative to each other when dequeued (since the sub-queue preserves that order), albeit with elements from other sub-queues possibly interleaved. Interleaving elements is OK because even in a traditional single-queue model, the order that elements get put in from from different producer threads is non-deterministic anyway (because there‘s a race condition between the different producers). [Edit: This is only true if the producers are independent, which isn‘t necessarily the case. See the comments.] The only downside to this approach is that if the queue is empty, every single sub-queue has to be checked in order to determine this (also, by the time one sub-queue is checked, a previously empty one could have become non-empty -- but in practice this doesn‘t cause problems). However, in the non-empty case, there is much less contention overall because sub-queues can be "paired up" with consumers. This reduces data sharing to the near-optimal level (where every consumer is matched with exactly one producer), without losing the ability to handle the general case. This pairing is done using a heuristic that takes into account the last sub-queue a producer successfully pulled from (essentially, it gives consumers an affinity). Of course, in order to do this pairing, some state has to be maintained between calls to dequeue -- this is done using consumer-specific "tokens" that the user is in charge of allocating. Note that tokens are completely optional -- the queue merely reverts to searching every sub-queue for an element without one, which is correct, just slightly slower when many threads are involved.

So, that‘s the high-level design. What about the core algorithm used within each sub-queue? Well, instead of being based on a linked-list of nodes (which implies constantly allocating and freeing or re-using elements, and typically relies on a compare-and-swap loop which can be slow under heavy contention), I based my queue on an array model. Instead of linking individual elements, I have a "block" of several elements. The logical head and tail indices of the queue are represented using atomically-incremented integers. Between these logical indices and the blocks lies a scheme for mapping each index to its block and sub-index within that block. An enqueue operation simply increments the tail (remember that there‘s only one producer thread for each sub-queue). A dequeue operation increments the head if it sees that the head is less than the tail, and then it checks to see if it accidentally incremented the head past the tail (this can happen under contention -- there‘s multiple consumer threads per sub-queue). If it did over-increment the head, a correction counter is incremented (making the queue eventually consistent), and if not, it goes ahead and increments another integer which gives it the actual final logical index. The increment of this final index always yields a valid index in the actual queue, regardless of what other threads are doing or have done; this works because the final index is only ever incremented when there‘s guaranteed to be at least one element to dequeue (which was checked when the first index was incremented).

So there you have it. An enqueue operation is done with a single atomic increment, and a dequeue is done with two atomic increments in the fast-path, and one extra otherwise. (Of course, this is discounting all the block allocation/re-use/referencing counting/block mapping goop, which, while important, is not very interesting -- in any case, most of those costs are amortized over an entire block‘s worth of elements.) The really interesting part of this design is that it allows extremely efficient bulk operations -- in terms of atomic instructions (which tend to be a bottleneck), enqueueing X items in a block has exactly the same amount of overhead as enqueueing a single item (ditto for dequeueing), provided they‘re in the same block. That‘s where the real performance gains come in :-)

I heard there was code

Since I thought there was rather a lack of high-quality lock-free queues for C++, I wrote one using this design I came up with. (While there are others, notably the ones in Boost and Intel‘s TBB, mine has more features, such as having no restrictions on the element type, and is faster to boot.) You can find it over at GitHub. It‘s all contained in a single header, and available under the simplified BSD license. Just drop it in your project and enjoy!

Benchmarks, yay!

So, the fun part of creating data structures is writing synthetic benchmarks and seeing how fast yours is versus other existing ones. For comparison, I used the Boost 1.55 lock-free queue, Intel‘s TBB 4.3 concurrent_queue, another linked-list based lock-free queue of my own (a na?ve design for reference), a lock-based queue using std::mutex, and a normal std::queue (for reference against a regular data structure that‘s accessed purely from one thread). Note that the graphs below only show a subset of the results, and omit both the na?ve lock-free and single-threaded std::queue implementations.

Here are the results! Detailed raw data follows the pretty graphs (note that I had to use a logarithmic scale due to the enormous differences in absolute throughput).

时间: 2024-11-03 22:48:40

【转】A Fast General Purpose Lock-Free Queue for C++的相关文章

PatentTips - Safe general purpose virtual machine computing system

BACKGROUND OF THE INVENTION The present invention relates to virtual machine implementations, and in particular to a safe general purpose virtual machine that generates optimized virtual machine computer programs that are executable in a general purp

把书《CUDA By Example an Introduction to General Purpose GPU Programming》读薄

鉴于自己的毕设需要使用GPU CUDA这项技术,想找一本入门的教材,选择了Jason Sanders等所著的书<CUDA By Example an Introduction to General Purpose GPU Programming>.这本书作为入门教材,写的很不错.自己觉得从理解与记忆的角度的出发,书中很多内容都可以被省略掉,于是就有了这篇博文.此博文记录与总结此书的笔记和理解.注意本文并没有按照书中章节的顺序来写.书中第8章图像互操作性和第11章多GPU系统上的CUDA C,这

libeXosip2(2) -- General purpose API.

General purpose API. general purpose API in libeXosip2-4.0.0. More... Modules eXosip2 configuration API eXosip2 network API eXosip2 event API Detailed Description general purpose API in libeXosip2-4.0.0.

General Purpose Hash Function Algorithms

General Purpose Hash Function Algorithms [email protected]: http://www.partow.net/programming/hashfunctions/index.html     Description Hashing Methodologies Hash Functions and Prime Numbers Bit Biases Various Forms Of Hashing String Hashing Cryptogra

MPSC lock free queue

[c实现的队列](http://www.1024cores.net/home/lock-free-algorithms/queues/non-intrusive-mpsc-node-based-queue) 下面是akka实现的一个MPSC队列. PS: 代码中注释对链头链尾判定的标准是添加的元素所在的位置为链尾,这和代码中的命名相冲突了 PPS: single customer 就不太需要考虑消费者的同时取的竞争状态 /**  * Copyright (C) 2009-2014 Typesaf

【转载】Getting Started with Spark (in Python)

Getting Started with Spark (in Python) Benjamin Bengfort Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has becom

Reading Fast Packet Processing A Survey

COMST 2018 主要内容 这是一篇有关快速包转发的综述,先介绍了包转发的有关基础知识和背景,具体介绍了包转发的主流方法,对这些方法进行了细致详尽的比较,最后介绍了最新的方法和未来的研究方向. 包处理包括Fast Path 和Slow Path,前者用于包转发和包头处理,后者主要用于管理.错误控制.维护. 主要的方法有三种:纯软件.纯硬件.软硬结合. 纯软件方法主要在软件层面(零拷贝.批处理.并行性.用户/内核空间)进行性能优化,性能不足的主要是因为网络协议栈架构的不足. 纯硬件方法性能高但

Unordered load/store queue

A method and processor for providing full?load/store?queue?functionality to an unordered?load/store?queue?for a processor with out-of-order execution.?Load and?store?instructions are inserted in a?load/store?queue?in execution order. Each entry in th

C和FORTRAN的快速傅里叶/余弦/正弦变换(Fast Fourier/Cosine/Sine Transform)开源库分享

Takuya Ooura: General Purpose FFT Package, http://www.kurims.kyoto-u.ac.jp/~ooura/fft.html. Free C & FORTRAN libraries for computing fast DCTs (types II-III) in one, two or three dimensions, power of 2 sizes. 百度云分享:http://pan.baidu.com/s/1mgW0bo8 密码: