LevelDB源码之一SkipList

SkipList称之为跳表，可实现Log(n)级别的插入、删除。和Map、set等典型的数据结构相比，其问题在于性能与插入数据的随机性有关，这和Q-Sort于Merge-Srot类似。

LevelDB做为单机数据库存储系统，正常操作下，整体(随机读写、顺序读写)性能上明显优于同类型的SQLite等数据库，这与内存数据采用的SkipList存储方式密切相关。

本文主要针对LevelDB中的SkipList的设计、实现的一些特点做备忘。

1. SkipList层级间的均匀分布，MaxHeight = 12， RandomHeight()

MaxHeight为SkipList的关键参数，与性能直接相关。

程序中修改MaxHeight时，在数值变小时，性能上有明显下降，但当数值增大时，甚至增大到10000时，和默认的MaxHeight=12相比仍旧无明显差异，内存使用上也是如此。

看如下代码：

	template<typename Key, class Comparator>
	int SkipList<Key, Comparator>::RandomHeight() {
		// Increase height with probability 1 in kBranching
		static const unsigned int kBranching = 4;
		int height = 1;

		while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
			height++;
		}
		assert(height > 0);
		assert(height <= kMaxHeight);
		return height;
	}

其中的关键在于粗体的kBranching及(rnd_.Next() % kBranching。这使得上层节点的数量约为下层的1/4。那么，当设定MaxHeight=12时，根节点为1时，约可均匀容纳Key的数量为4^11=4194304(约为400W)。

当单独增大MaxHeight时，并不会使得SkipList的层级提升。MaxHeight=12为经验值，在百万数据规模时，尤为适用。

2. 读写并发

读值本身并不会改变SkipList的结构，因此多个读之间不存在并发问题。

而当读、写同时存在时，SkipList通过AtomicPointer(原子指针)及结构调整上的小技巧达到“无锁”并发。

SkipList<Key, Comparator>::Node

首先，节点一旦被添加到SkipList中，其层级结构将不再发生变化，Node中的唯一成员：port::AtomicPointer next_[1] 大小不会再发生改变。

port::AtomicPointer next_[1];用于站位，实际的数组大小和本节点的Height一致，Node创建代码如下：

1     template<typename Key, class Comparator>
2     typename SkipList<Key, Comparator>::Node*
3         SkipList<Key, Comparator>::NewNode(const Key& key, int height) {
4         char* mem = arena_->AllocateAligned(
5             sizeof(Node) + sizeof(port::AtomicPointer) * (height - 1));
6         return new (mem) Node(key);
7     }

其中，Line4根据height创建真正大小的Node，Line6显示调用构造函数，完成Node创建(这种用法并不常见)。

再来看Node的四个成员函数：

1         // Accessors/mutators for links.  Wrapped in methods so we can
2         // add the appropriate barriers as necessary.
3         Node* Next(int n);
4         void SetNext(int n, Node* x) ;
5
6         // No-barrier variants that can be safely used in a few locations.
7         Node* NoBarrier_Next(int n);
8         void NoBarrier_SetNext(int n, Node* x);

上面两组为线程安全访问操作，下面两组为非线程安全访问操作。后两组函数是作者追求极致性能时，降低了对封装的要求。

template<typename Key, class Comparator> class SkipList

读操作时的并发处理主要体现在：使用Next成员函数执行原子的下一条查找动作。

写操作的并发处理稍复杂，下面为Insert代码：

 1     template<typename Key, class Comparator>
 2     void SkipList<Key, Comparator>::Insert(const Key& key) {
 3         // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
 4         // here since Insert() is externally synchronized.
 5         Node* prev[kMaxHeight];
 6         Node* x = FindGreaterOrEqual(key, prev);
 7
 8         // Our data structure does not allow duplicate insertion
 9         assert(x == NULL || !Equal(key, x->key));
10
11         int height = RandomHeight();
12         if (height > GetMaxHeight()) {
13             for (int i = GetMaxHeight(); i < height; i++) {
14                 prev[i] = head_;
15             }
16             //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);
17
18             // It is ok to mutate max_height_ without any synchronization
19             // with concurrent readers.  A concurrent reader that observes
20             // the new value of max_height_ will see either the old value of
21             // new level pointers from head_ (NULL), or a new value set in
22             // the loop below.  In the former case the reader will
23             // immediately drop to the next level since NULL sorts after all
24             // keys.  In the latter case the reader will use the new node.
25             max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
26         }
27
28         x = NewNode(key, height);
29         for (int i = 0; i < height; i++) {
30             // NoBarrier_SetNext() suffices since we will add a barrier when
31             // we publish a pointer to "x" in prev[i].
32             x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));    //为性能及并发考虑的深度优化，这里的两个NoBarrier
33             prev[i]->SetNext(i, x);
34         }
35     }

15行之前用于查找插入的位置，25行执行了第一个状态变更：设置当前的max_height_。

作者的注释指明了并发读时可能存在的两种情况，但完整描述应该如下：

1. 读到旧的max_height_，而后写线程更新了max_height_并正在进行或完成节点插入

2. 读到新的max_height_，而写线程正在进行或完成节点插入

对于上述两种(其实是多种，这里为细分)情况，作者说明并不存在并发问题，为何呢？

关键在于28-34行插入方式：

28         x = NewNode(key, height);
29         for (int i = 0; i < height; i++) {
30             // NoBarrier_SetNext() suffices since we will add a barrier when
31             // we publish a pointer to "x" in prev[i].
32             x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));    //为性能及并发考虑的深度优化，这里的两个NoBarrier
33             prev[i]->SetNext(i, x);
34         }关键在哪里？两点：29行的for循环顺序及33行的SetNext.1. 由最下层向上插入可以保证当前层一旦插入后，其下层状态已经更新。2. SetNext为原子操作，保证读线程在调用Next查找节点时不存在并发问题额外需注意的是，32行中，作者为了保证性能最优在x的SetNext及prev的Next均采用了非线程安全的方式。

当然，多个写之间的并发SkipList时非线程安全的，在LevelDB的MemTable中采用了另外的技巧来处理写并发问题。

template<typename Key, class Comparator> class SkipList<Key, Comparator>::Iterator 

SkipList的迭代器，支持双向遍历，其实现本身并无特别之处，只不过是SkipList的一个封装，略。

Insert: 1252072 Contains: 1296074

时间： 2024-10-07 21:19:17

LevelDB源码之一SkipList的相关文章

LevelDB源码剖析

LevelDB的公共部件并不复杂,但为了更好的理解其各个核心模块的实现,此处挑几个关键的部件先行备忘. Arena(内存领地) Arena类用于内存管理,其存在的价值在于: 提高程序性能,减少Heap调用次数,由Arena统一分配后返回到应用层. 分配后无需执行dealloc,当Arena对象释放时,统一释放由其创建的所有内存. 便于内存统计,如Arena分配的整体内存大小等信息. 1 class Arena { 2 public: 3 Arena(); 4 ~Arena(); 5 6 // R

leveldb 源码阅读，细节记录memberTable

leveldb 是看着前辈们的大概分析,然后看着源码,将自己的疑惑和解决记录下来: Leveldb源码分析从memberTable插入一条记录和查找一条记录从上而下分析插入: 插入的函数 void MemTable::Add(SequenceNumber s, ValueType type,const Slice& key,const Slice& value) 参数: SequenceNumber 插入的序号,在skiplist里,这个序号是降序列的 ValueType typ

leveldb源码笔记

最近读了一下leveldb源码,leveldb最主要的操作就是get/set,因此从get/set的实现入手,了解一下实现机制. 之前也看过leveldb相关介绍以及别人的分析blog,已经有了一定了解.leveldb如其名,按照层级来组织数据,数据从内存到磁盘一层一层迁移.在内存中是通过skiplist来管理数据,而磁盘上则是一种名为SSTable(Sorted Strings Table)的结构来存储数据的. DB::Get实现这个头文件include/leveldb/db.h定义了DB抽

leveldb源码分析--Memtable

本节讲述内存中LevelDB的数据结构Memtable,Memtable义如其名即为内存中的KV Table,即LSM-Tree中的C0 Tree.我们知道在LSM-Tree中刚插入的的KV数据都是存储在内存中,当内存中存储的数据超过一定量以后再写到磁盘中.而对于leveldb来说这个过程演变为内存中的数据都是插入到MemTable中,当MemTable中的数据超过一定量(Options.write_buffer_size)以后MemTable就转化为Immutable Memtable等待du

leveldb源码分析--插入删除流程

由于网络上对leveldb的分析文章都比较丰富,一些基础概念和模型都介绍得比较多,所以本人就不再对这些概念以专门的篇幅进行介绍,本文主要以代码流程注释的方式. 首先我们从db的插入和删除开始以对整个体系有一个感性的认识,首先看插入: Status DB::Put(const WriteOptions& opt, const Slice& key, const Slice& value) { WriteBatch batch; //leveldb中不管单个插入还是多个插入都是以Wri

leveldb源码分析--Key结构

[注]本文参考了sparkliang的专栏的Leveldb源码分析--3并进行了一定的重组和排版经过上一篇文章的分析我们队leveldb的插入流程有了一定的认识,而该文设计最多的又是Batch的概念.这篇文章本来应该顺理成章的介绍Batch相关流程和结构了,但是Batch涉及到了一些编码和Key相关的概念,所以我觉得应该先理清这方面的概念有助于大家更容易理解后面的内容. 在dbformat.h/cc文件中我们首先看到的是 typedef uint64_t SequenceNumber; str

leveldb源码分析--Comparator

既然leveldb是一个按Key序组织的LSM-Tree实现,那么对于Key的比较就是非常之重要了,这个Key的比较在leveldb中是Comparator的形式出现的.我们首先来看看Comparator的基本方法有哪些 // 实际的比较函数 virtual int Compare(const Slice& a, const Slice& b) const = 0; // 名称,主要是为了防止建立和读取时使用了不同的Comparator virtual const char* Name()

LevelDB源码分析--Iterator

我们先来参考来至使用Iterator简化代码2-TwoLevelIterator的例子,略微修改希望能帮助更加容易立即,如果有不理解请各位看客阅读原文. 下面我们再来看一个例子,我们为一个书店写程序,书店里有许多书Book,每个书架(BookShelf)上有多本书. 类结构如下所示 class Book { private: string book_name_; }; class Shelf { private: vector<Book> books_; }; 如何遍历书架上所有的书呢?一种实

leveldb源码分析--SSTable之逻辑结构

SSTable是leveldb 的核心模块,这也是其称为leveldb的原因,leveldb正是通过将数据分为不同level的数据分为对应的不同的数据文件存储到磁盘之中的.为了理解其机制,我们首先看看SSTable中的基本概念. 首先看看数据的整体存储结构: 可以从图中看到了几个概念:Datablock,Metablock, MetaIndex block, Indexblock, Footer.具体他们的含义可以大致解释如下: 1. Datablock,我们知道文件中的k/v对是有序存储的,他