[转载] 首席工程师揭秘：LinkedIn大数据后台是如何运作的？(一）

本文作者：Jay Kreps，linkedin公司首席工程师；文章来自于他在linkedin上的分享；原文标题：The Log: What every software engineer should know about real-time data’s unifying abstraction。

文章内容非常干货，非常值得学习。文章将以四部分进行阐述，建议大家耐心看完。

第一部分：Log是什么？

第二部分：数据集成

第三部分：日志和实时流处理

第四部分：系统建设

我在六年前的一个令人兴奋的时刻加入到LinkedIn公司。从那个时候开始我们就破解单一的、集中式数据库的限制，并且启动到特殊的分布式系统套件的转换。这是一件令人兴奋的事情：我们构建、部署，而且直到今天仍然在运行的分布式图形数据库、分布式搜索后端、Hadoop安装以及第一代和第二代键值数据存储。

从这一切里我们体会到的最有益的事情是我们构建的许多东西的核心里都包含一个简单的理念：日志。有时候也称作预先写入日志或者提交日志或者事务日志，日志几乎在计算机产生的时候就存在，同时它还是许多分布式数据系统和实时应用结构的核心。

不懂得日志，你就不可能完全懂得数据库，NoSQL存储，键值存储，复制，paxos,Hadoop,版本控制以及几乎所有的软件系统；然而大多数软件工程师对它们不是很熟悉。我愿意改变这种现状。在这篇博客文章里，我将带你浏览你必须了解的有关日志的所有的东西，包括日志是什么，如何在数据集成、实时处理和系统构建中使用日志等。

第一部分：日志是什么？

日志是一种简单的不能再简单的存储抽象。它是一个只能增加的，完全按照时间排序的一系列记录。日志看起来如下：

我们可以给日志的末尾添加记录，并且可以从左到右读取日志记录。每一条记录都指定了一个唯一的有一定顺序的日志记录编号。

日志记录的排序是由“时间”来确定的，这是因为位于左边的日志记录比位于右边的要早些。日志记录编号可以看作是这条日志记录的“时间戳”。在一开始就把这种排序说成是按时间排序显得有点多余，不过，与任何一个具体的物理时钟相比，时间属性是非常便于使用的属性。在我们运行多个分布式系统的时候，这个属性就显得非常重要。

对于这篇讨论的目标而言，日志记录的内容和格式不怎么重要。另外提醒一下，在完全耗尽存储空间的情况下，我们不可能再给日志添加记录。稍后我们将会提到这个问题。

日志并不是完全不同于文件或者数据表的。文件是由一系列字节组成，表是由一系列记录组成，而日志实际上只是按照时间顺序存储记录的一种数据表或者文件。

此时，你可能奇怪为什么要讨论这么简单的事情呢？不同环境下的一个只可增加的有一定顺序的日志记录是怎样与数据系统关联起来的呢？答案是日志有其特定的应用目标：它记录了什么时间发生了什么事情。而对分布式数据系统许多方面而言，这才是问题的真正核心。

不过，在我们进行更加深入的讨论之前，让我先澄清有些让人混淆的概念。每个编程人员都熟悉另一种日志记录-应用使用syslog或者log4j可能写入到本地文件里的没有结构的错误信息或者追踪信息。为了区分开来，我们把这种情形的日志记录称为“应用日志记录”。应用日志记录是我在这儿所说的日志的一种低级的变种。最大的区别是：文本日志意味着主要用来方便人们阅读，而我所说明的“日志”或者“数据日志”的建立是方便程序访问。

（实际上，如果你对它进行深入的思考，那么人们读取某个机器上的日志这种理念有些不顺应时代潮流。当涉及到许多服务和服务器的时候，这种方法很快就变成一个难于管理的方式，而且为了认识多个机器的行为，日志的目标很快就变成查询和图形化这些行为的输入了-对多个机器的某些行为而言，文件里的英文形式的文本同这儿所描述的这种结构化的日志相比几乎就不适合了。）

数据库日志

我不知道日志概念起源于何处-可能它就像二进制搜索一样：发明者认为它太简单而不能当作一项发明。它早在IBM的系统R出现时候就出现了。数据库里的用法是在崩溃的时候用它来同步各种数据结构和索引。为了保证操作的原子性和持久性，在对数据库维护的所有各种数据结构做更改之前，数据库把即将修改的信息誊写到日志里。日志记录了发生了什么，而且其中的每个表或者索引都是一些数据结构或者索引的历史映射。由于日志是即刻永久化的，可以把它当作崩溃发生时用来恢复其他所有永久性结构的可信赖数据源。

随着时间的推移，日志的用途从实现ACID细节成长为数据库间复制数据的一种方法。利用日志的结果就是发生在数据库上的更改顺序与远端复制数据库上的更改顺序需要保持完全同步。

Oracle,MySQL 和PostgreSQL都包括用于给备用的复制数据库传输日志的日志传输协议。Oracle还把日志产品化为一个通用的数据订阅机制，这样非Oracle数据订阅用户就可以使用XStreams和GoldenGate订阅数据了，MySQL和PostgreSQL上的类似的实现则成为许多数据结构的关键组件。
正是由于这样的起源，机器可识别的日志的概念大部分都被局限在数据库内部。日志用做数据订阅的机制似乎是偶然出现的，不过要把这种抽象用于支持所有类型的消息传输、数据流和实时数据处理是不切实际的。

分布式系统日志

日志解决了两个问题：更改动作的排序和数据的分发，这两个问题在分布式数据系统里显得尤为重要。协商出一致的更改动作的顺序（或者说保持各个子系统本身的做法，但可以进行存在副作用的数据拷贝）是分布式系统设计的核心问题之一。

以日志为中心实现分布式系统是受到了一个简单的经验常识的启发，我把这个经验常识称为状态机复制原理：如果两个相同的、确定性的进程从同一状态开始，并且以相同的顺序获得相同的输入，那么这两个进程将会生成相同的输出，并且结束在相同的状态。

这也许有点难以理解，让我们更加深入的探讨，弄懂它的真正含义。

确定性意味着处理过程是与时间无关的，而且任何其他“外部的“输入不会影响到处理结果。例如，如果一个程序的输出会受到线程执行的具体顺序影响，或者受到gettimeofday调用、或者其他一些非重复性事件的影响，那么这样的程序一般最有可能被认为是非确定性的。

进程状态是进程保存在机器上的任何数据，在进程处理结束的时候，这些数据要么保存在内存里，要么保存在磁盘上。

以相同的顺序获得相同输入的地方应当引起注意-这就是引入日志的地方。这儿有一个重要的常识：如果给两段确定性代码相同的日志输入，那么它们就会生成相同的输出。

分布式计算这方面的应用就格外明显。你可以把用多台机器一起执行同一件事情的问题缩减为实现分布式一致性日志为这些进程输入的问题。这儿日志的目的是把所有非确定性的东西排除在输入流之外，来确保每个复制进程能够同步地处理输入。

当你理解了这个以后，状态机复制原理就不再复杂或者说不再深奥了：这或多或少的意味着“确定性的处理过程就是确定性的”。不管怎样，我都认为它是分布式系统设计里较常用的工具之一。

这种方式的一个美妙之处就在于索引日志的时间戳就像时钟状态的一个副本——你可以用一个单独的数字描述每一个副本，这就是经过处理的日志的时间戳。时间戳与日志一一对应着整个副本的状态。

由于写进日志的内容的不同，也就有许多在系统中应用这个原则的不同方式。举个例子，我们记录一个服务的请求，或者服务从请求到响应的状态变化，或者它执行命令的转换。理论上来说，我们甚至可以为每一个副本记录一系列要执行的机器指令或者调用的方法名和参数。只要两个进程用相同的方式处理这些输入，这些进程就会保持副本的一致性。

一千个人眼中有一千种日志的用法。数据库工作者通常区分物理日志和逻辑日志。物理日志就是记录每一行被改变的内容。逻辑日志记录的不是改变的行而是那些引起行的内容被改变的SQL语句（insert，update和delete语句）。

分布式系统通常可以宽泛分为两种方法来处理数据和完成响应。“状态机器模型”通常引用一个主动-主动的模型——也就是我们为之记录请求和响应的对象。对此进行一个细微的更改，称之为“预备份模型”，就是选出一个副本做为leader，并允许它按照请求到达的时间来进行处理并从处理过程中输出记录其状态改变的日志。其他的副本按照leader状态改变的顺序而应用那些改变，这样他们之间达到同步，并能够在leader失败的时候接替leader的工作。

为了理解两种方式的不同，我们来看一个不太严谨的例子。假定有一个算法服务的副本，保持一个独立的数字作为它的状态（初始值为0），并对这个值进行加法和乘法运算。主动-主动方式应该会输出所进行的变换，比如“+1”，“*2”等。每一个副本都会应用这些变换，从而得到同样的解集。主动-被动方式将会有一个独立的主体执行这些变换并输出结果日志，比如“1”，“3”，“6”等。这个例子也清楚的展示了为什么说顺序是保证各副本间一致性的关键：一次加法和乘法的顺序的改变将会导致不同的结果。

分布式日志可以理解为一致性问题模型的数据结构。因为日志代表了后续追加值的一系列决策。你需要重新审视Paxos算法簇，尽管日志模块是他们最常见的应用。在Paxos算法中，它通常通过使用称之为多paxos的协议，这种协议将日志建模为一系列的问题，在日志中每个问题都有对应的部分。在ZAB， RAFT等其它的协议中，日志的作用尤为突出，它直接对维护分布式的、一致性的日志的问题建模。

我怀疑的是，我们就历史发展的观点是有偏差的，可能是由于过去的几十年中，分布式计算的理论远超过了其实际应用。在现实中，共识的问题是有点太简单了。计算机系统很少需要决定单个值，他们几乎总是处理成序列的请求。这样的记录，而不是一个简单的单值寄存器，自然是更加抽象。

此外，专注于算法掩盖了抽象系统需要的底层的日志。我怀疑，我们最终会把日志中更注重作为一个商品化的基石，不论其是否以同样的方式实施的，我们经常谈论一个哈希表而不是纠结我们得到是不是具体某个细节的哈希表，例如线性或者带有什么什么其它变体哈希表。日志将成为一种大众化的接口，为大多数算法和其实现提升提供最好的保证和最佳的性能。

变更日志101：表与事件的二相性。

让我们继续聊数据库。数据库中存在着大量变更日志和表之间的二相性。这些日志有点类似借贷清单和银行的流程，数据库表就是当前的盈余表。如果你有大量的变更日志，你就可以使用这些变更用以创建捕获当前状态的表。这张表将记录每个关键点（日志中一个特别的时间点）的状态信息。这就是为什么日志是非常基本的数据结构的意义所在：日志可用来创建基本表，也可以用来创建各类衍生表。同时意味着可以存储非关系型的对象。

这个流程也是可逆的：如果你正在对一张表进行更新，你可以记录这些变更，并把所有更新的日志发布到表的状态信息中。这些变更日志就是你所需要的支持准实时的克隆。基于此，你就可以清楚的理解表与事件的二相性：表支持了静态数据而日志捕获变更。日志的魅力就在于它是变更的完整记录，它不仅仅捕获了表的最终版本的内容，它还记录了曾经存在过的其它版本的信息。日志实质上是表历史状态的一系列备份。

这可能会引起你对源代码的版本管理。源代码管理和数据库之间有密切关系。版本管理解决了一个大家非常熟悉的问题，那就是什么是分布式数据系统需要解决的— 时时刻刻在变化着的分布式管理。版本管理系统通常以补丁的发布为基础，这实际上可能是一个日志。您可以直接对当前类似于表中的代码做出“快照”互动。你会注意到，与其他分布式状态化系统类似，版本控制系统当你更新时会复制日志，你希望的只是更新补丁并将它们应用到你的当前快照中。

最近，有些人从Datomic –一家销售日志数据库的公司得到了一些想法。这些想法使他们对如何在他们的系统应用这些想法有了开阔的认识。当然这些想法不是只针对这个系统，他们会成为十多年分布式系统和数据库文献的一部分。

这可能似乎有点过于理想化。但是不要悲观！我们会很快把它实现。

接下来的内容

在这篇文章的其余部分，我将试图说明日志除了可用在分布式计算或者抽象分布式计算模型内部之外，还可用在哪些方面。其中包括：

数据集成-让机构的全部存储和处理系统里的所有数据很容易地得到访问。
实时数据处理-计算生成的数据流。
分布式系统设计-实际应用的系统是如何通过使用集中式日志来简化设计的。

所有这些用法都是通过把日志用做单独服务来实现的。

在上面任何一种用法里，日志的用途开始都是使用了日志所能提供的某个简单功能：生成永久的、可重现的历史记录。令人意外的是，问题的核心是可以让多少台机器以特定的方式，按照自身的速度重现历史记录的能力。

英语原文：

I joined LinkedIn about six years ago at a particularly interesting time. We were just beginning to run up against the limits of our monolithic, centralized database and needed to start the transition to a portfolio of specialized distributed systems. This has been an interesting experience: we built, deployed, and run to this day a distributed graph database, a distributed search backend, a Hadoop installation, and a first and second generation key-value store.
One of the most useful things I learned in all this was that many of the things we were building had a very simple concept at their heart: the log. Sometimes called write-ahead logs or commit logs or transaction logs, logs have been around almost as long as computers and are at the heart of many distributed data systems and real-time application architectures.
You can’t fully understand databases, NoSQL stores, key value stores, replication, paxos, hadoop, version control, or almost any software system without understanding logs; and yet, most software engineers are not familiar with them. I’d like to change that. In this post, I’ll walk you through everything you need to know about logs, including what is log and how to use logs for data integration, real time processing, and system building.
Part One: What Is a Log?

A log is perhaps the simplest possible storage abstraction. It is an append-only, totally-ordered sequence of records ordered by time. It looks like this:

Records are appended to the end of the log, and reads proceed left-to-right. Each entry is assigned a unique sequential log entry number.

The ordering of records defines a notion of “time” since entries to the left are defined to be older then entries to the right. The log entry number can be thought of as the “timestamp” of the entry. Describing this ordering as a notion of time seems a bit odd at first, but it has the convenient property that it is decoupled from any particular physical clock. This property will turn out to be essential as we get to distributed systems.

The contents and format of the records aren’t important for the purposes of this discussion. Also, we can’t just keep adding records to the log as we’ll eventually run out of space. I’ll come back to this in a bit.

So, a log is not all that different from a file or a table. A file is an array of bytes, a table is an array of records, and a log is really just a kind of table or file where the records are sorted by time.

At this point you might be wondering why it is worth talking about something so simple? How is a append-only sequence of records in any way related to data systems? The answer is that logs have a specific purpose: they record what happened and when. For distributed data systems this is, in many ways, the very heart of the problem.

But before we get too far let me clarify something that is a bit confusing. Every programmer is familiar with another definition of logging—the unstructured error messages or trace info an application might write out to a local file using syslog or log4j. For clarity I will call this “application logging”. The application log is a degenerative form of the log concept I am describing. The biggest difference is that text logs are meant to be primarily for humans to read and the “journal” or “data logs” I’m describing are built for programmatic access.

(Actually, if you think about it, the idea of humans reading through logs on individual machines is something of an anachronism. This approach quickly becomes an unmanageable strategy when many services and servers are involved and the purpose of logs quickly becomes as an input to queries and graphs to understand behavior across many machines—something for which english text in files is not nearly as appropriate as the kind structured log described here.)

Logs in databases

I don’t know where the log concept originated—probably it is one of those things like binary search that is too simple for the inventor to realize it was an invention. It is present as early as IBM’s System R. The usage in databases has to do with keeping in sync the variety of data structures and indexes in the presence of crashes. To make this atomic and durable, a database uses a log to write out information about the records they will be modifying, before applying the changes to all the various data structures it maintains. The log is the record of what happened, and each table or index is a projection of this history into some useful data structure or index. Since the log is immediately persisted it is used as the authoritative source in restoring all other persistent structures in the event of a crash.

Over-time the usage of the log grew from an implementation detail of ACID to a method for replicating data between databases. It turns out that the sequence of changes that happened on the database is exactly what is needed to keep a remote replica database in sync. Oracle, MySQL, and PostgreSQL include log shipping protocols to transmit portions of log to replica databases which act as slaves. Oracle has productized the log as a general data subscription mechanism for non-oracle data subscribers with their XStreams and GoldenGate and similar facilities in MySQL and PostgreSQL are key components of many data architectures.

Because of this origin, the concept of a machine readable log has largely been confined to database internals. The use of logs as a mechanism for data subscription seems to have arisen almost by chance. But this very abstraction is ideal for supporting all kinds of messaging, data flow, and real-time data processing.

Logs in distributed systems

The two problems a log solves—ordering changes and distributing data—are even more important in distributed data systems. Agreeing upon an ordering for updates (or agreeing to disagree and coping with the side-effects) are among the core design problems for these systems.

The log-centric approach to distributed systems arises from a simple observation that I will call the State Machine Replication Principle:

If two identical, deterministic processes begin in the same state and get the same inputs in the same order, they will produce the same output and end in the same state.

This may seem a bit obtuse, so let’s dive in and understand what it means.

Deterministic means that the processing isn’t timing dependent and doesn’t let any other “out of band” input influence its results. For example a program whose output is influenced by the particular order of execution of threads or by a call to gettimeofday or some other non-repeatable thing is generally best considered as non-deterministic.

The state of the process is whatever data remains on the machine, either in memory or on disk, at the end of the processing.

The bit about getting the same input in the same order should ring a bell—that is where the log comes in. This is a very intuitive notion: if you feed two deterministic pieces of code the same input log, they will produce the same output.
The application to distributed computing is pretty obvious. You can reduce the problem of making multiple machines all do the same thing to the problem of implementing a distributed consistent log to feed these processes input. The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.

When you understand it, there is nothing complicated or deep about this principle: it more or less amounts to saying “deterministic processing is deterministic”. Nonetheless, I think it is one of the more general tools for distributed systems design.

One of the beautiful things about this approach is that the time stamps that index the log now act as the clock for the state of the replicas—you can describe each replica by a single number, the timestamp for the maximum log entry it has processed. This timestamp combined with the log uniquely captures the entire state of the replica.

There are a multitude of ways of applying this principle in systems depending on what is put in the log. For example, we can log the incoming requests to a service, or the state changes the service undergoes in response to request, or the transformation commands it executes. Theoretically, we could even log a series of machine instructions for each replica to execute or the method name and arguments to invoke on each replica. As long as two processes process these inputs in the same way, the processes will remaining consistent across replicas.

Different groups of people seem to describe the uses of logs differently. Database people generally differentiate between physical and logical logging. Physical logging means logging the contents of each row that is changed. Logical logging means logging not the changed rows but the SQL commands that lead to the row changes (the insert, update, and delete statements).

The distributed systems literature commonly distinguishes two broad approaches to processing and replication. The “state machine model” usually refers to an active-active model where we keep a log of the incoming requests and each replica processes each request. A slight modification of this, called the “primary-backup model”, is to elect one replica as the leader and allow this leader to process requests in the order they arrive and log out the changes to its state from processing the requests. The other replicas apply in order the state changes the leader makes so that they will be in sync and ready to take over as leader should the leader fail.

To understand the difference between these two approaches, let’s look at a toy problem. Consider a replicated “arithmetic service” which maintains a single number as its state (initialized to zero) and applies additions and multiplications to this value. The active-active approach might log out the transformations to apply, say “+1″, “*2″, etc. Each replica would apply these transformations and hence go through the same set of values. The “active-passive” approach would have a single master execute the transformations and log out the result, say “1”, “3”, “6”, etc. This example also makes it clear why ordering is key for ensuring consistency between replicas: reordering an addition and multiplication will yield a different result.

The distributed log can be seen as the data structure which models the problem of consensus. A log, after all, represents a series of decisions on the “next” value to append. You have to squint a little to see a log in the Paxos family of algorithms, though log-building is their most common practical application. With Paxos, this is usually done using an extension of the protocol called “multi-paxos”, which models the log as a series of consensus problems, one for each slot in the log. The log is much more prominent in other protocols such as ZAB, RAFT, and Viewstamped Replication, which directly model the problem of maintaining a distributed, consistent log.

My suspicion is that our view of this is a little bit biased by the path of history, perhaps due to the few decades in which the theory of distributed computing outpaced its practical application. In reality, the consensus problem is a bit too simple. Computer systems rarely need to decide a single value, they almost always handle a sequence of requests. So a log, rather than a simple single-value register, is the more natural abstraction.

Furthermore, the focus on the algorithms obscures the underlying log abstraction systems need. I suspect we will end up focusing more on the log as a commoditized building block irrespective of its implementation in the same way we often talk about a hash table without bothering to get in the details of whether we mean the murmur hash with linear probing or some other variant. The log will become something of a commoditized interface, with many algorithms and implementations competing to provide the best guarantees and optimal performance.

Changelog 101: Tables and Events are Dual

Let’s come back to databases for a bit. There is a facinating duality between a log of changes and a table. The log is similar to the list of all credits and debits and bank processes; a table is all the current account balances. If you have a log of changes, you can apply these changes in order to create the table capturing the current state. This table will record the latest state for each key (as of a particular log time). There is a sense in which the log is the more fundamental data structure: in addition to creating the original table you can also transform it to create all kinds of derived tables. (And yes, table can mean keyed data store for the non-relational folks.)

This process works in reverse too: if you have a table taking updates, you can record these changes and publish a “changelog” of all the updates to the state of the table. This changelog is exactly what you need to support near-real-time replicas. So in this sense you can see tables and events as dual: tables support data at rest and logs capture change. The magic of the log is that if it is a complete log of changes, it holds not only the contents of the final version of the table, but also allows recreating all other versions that might have existed. It is, effectively, a sort of backup of every previous state of the table.

This might remind you of source code version control. There is a close relationship between source control and databases. Version control solves a very similar problem to what distributed data systems have to solve—managing distributed, concurrent changes in state. A version control system usually models the sequence of patches, which is in effect a log. You interact directly with a checked out “snapshot” of the current code which is analogous to the table. You will note that in version control systems, as in other distributed stateful systems, replication happens via the log: when you update, you pull down just the patches and apply them to your current snapshot.

Some people have seen some of these ideas recently from Datomic, a company selling a log-centric database. This presentation gives a great overview of how they have applied the idea in their system. These ideas are not unique to this system, of course, as they have been a part of the distributed systems and database literature for well over a decade.

This may all seem a little theoretical. Do not despair! We’ll get to practical stuff pretty quickly.

原文作者：Jay Kreps 译者：LitStone, super0555, 几点人, cmy00cmy, tnjin, 928171481, 黄劼等。来自：开源中国