Let it crash philosophy for distributed systems

This past weekend I read Joe Armstrong’s paper on the history of Erlang. Now, HOPL papers in general are like candy for me, and this one did not disappoint. There’s more in this paper that I can cover in one post, so today I’m going to concentrate on one particular feature of Erlang highlighted by Armstrong.

Although Erlang is designed to encourage/facilitate a massively parallel programming style, its error handling may be even more noteworthy. Like everything else in Erlang, its error handling is designed to be distributed, and for good reason:

Error handling in Erlang is very different from error handling in conventional programming languages. The key observation here is to note that the error-handling mechanisms were designed for building fault-tolerant systems, and not merely for protecting from program exceptions. You cannot build a fault-tolerant system if you only have one computer. The minimal configuration for a fault tolerant system has two computers. These must be configured so that both observe each other. If one of the computers crashes, then the other computer must take over whatever the first computer was doing.

This means that the model for error handling is based on the idea of two computers that observe each other.

Erlang is famous for its features which help programmers to produce stable systems in the real world. Its shared nothing architecture and ability to hot-swap code are well-known. But these features are available in other systems. The "links" feature, on the other hand, seems to be unique. When you create a process in Erlang, you can link it to another process; this link essentially means, "If that process crashes, I’d like to crash, also; and if I crash, that process should die, too." Here is Armstrong’s description:

Links in Erlang are provided to control error propagation paths for errors between processes. An Erlang process will die if it evaluates illegal code, so, for example, if a process tries to divide by zero it will die. The basic model of error handling is to assume that some other process in the system will observe the death of the process and take appropriate corrective actions. But which process in the system should do this? If there are several thousand processes in the system then how do we know which process to inform when an error occurs? The answer is the linked process. If some process A evaluates the primitive link(B) then it becomes linked to A . If A dies then B is informed. If B dies then A is informed.

Using links, we can create sets of processes that are linked together. If these are normal processes, they will die immediately if they are linked to a process that dies with an error. The idea here is to create sets of processes such that if any process in the set dies, then they will all die. This mechanism provides the invariant that either all the processes in the set are alive or none of them are. This is very useful for programming error-recovery strategies in complex situations. As far as I know, no other programming language has anything remotely like this.

In addition to simply killing the linked process, the link can also function as a kind of signal to a system process that a group of processes have died, so that appropriate action can be taken, such as restarting a process group.

Like Armstrong, I cannot think of another system that works quite this way. The closest analogy I can think of is a distributed transaction. But distributed transactions have quite a bit more overhead, because they’re all about providing serializable access to shared data, which Erlang just doesn’t allow.

Armstrong says that the idea of links was inspired by the ‘C wire’ in early telephone exchanges:

The C wire went back to the exchange and through all the electromechanical relays involved in setting up a call. If anything went wrong, or if either partner terminated the call, then the C wire was grounded. Grounding the C wire caused a knock-on effect in the exchange that freed all resources connected to the C line.

Armstrong says that the links feature encourages a worker/supervisor style of programming which is "not possible in a single threaded language."

时间: 2024-08-08 01:06:40

Let it crash philosophy for distributed systems的相关文章

Let it crash philosophy part II

Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about all of the things that can go wrong with your software and code defensively for these situations, but in a complex system it is very likely that some

可扩展的Web系统和分布式系统(Scalable Web Architecture and Distributed Systems)

Open source software has become a fundamental building block for some of the biggest websites. And as those websites have grown, best practices and guiding principles around their architectures have emerged. This chapter seeks to cover some of the ke

Scalable, Distributed Systems Using Akka, Spring Boot, DDD, and Java--转

原文地址:https://dzone.com/articles/scalable-distributed-systems-using-akka-spring-boot-ddd-and-java When data that needs to be processed grows large and can’t be contained within a single JVM, AKKA clusters provides features to build such highly scalabl

[分布式系统学习]阅读笔记 Distributed systems for fun and profit 抽象 之二

本文是阅读 http://book.mixu.net/distsys/abstractions.html 的笔记. 第二章的题目是"Up and down the level of abstraction".这一章里面,作者主要介绍了分布式系统里面的一个重要概念:CAP理论. 什么是CAP理论呢?就是说在任何情况下,分布式系统只能满足下面三项中的两个: 一致性(Consistency),这里指的强一致性. 可用性(Availability). 对网络分割容错(Partition tol

[分布式系统学习]阅读笔记 Distributed systems for fun and profit 之三 时间和顺序

这是阅读 http://book.mixu.net/distsys/time.html 的笔记,是该系列的第三章. 为什么时间和顺序很重要呢?为什么我们关系事件A发生在事件B之前? 因为分布式系统要解决的问题是把单机上的问题通过多机来解决.然而传统单机的程序总是假设确定的顺序.对于分布式程序来说,正确性最简单的定义就是,跑起来像一台单机上运行的程序. 全序和偏序 具体的定义大家可以去翻离散书.简单地说,全序就是在集合里任何两个元素都可以比较,分出大小.偏序中,某些元素是没办法比较大小的. 在单节

[分布式系统学习]阅读笔记 Distributed systems for fun and profit 之四 Replication 拷贝

阅读http://book.mixu.net/distsys/replication.html的笔记,是本系列的第四章 拷贝其实是一组通信问题,为一些子问题,例如选举,失灵检测,一致性和原子广播提供了上下文. 同步拷贝 可以看到三个不同阶段,首先client发送请求.然后同步拷贝,同步意味着这时候client还在等待着请求返回.最后,服务器返回. 这就是N-of-N write,只有等所有N个节点成功写,才返回写成功给client.系统不容忍任何服务器下线.从性能上说,最慢的服务器决定了写的速度

《DISTRIBUTED SYSTEMS Concepts and Design》读书笔记 一

p.p1 { margin: 0.0px 0.0px 0.0px 0.0px; font: 16.4px Arial; color: #333333 } p.p2 { margin: 0.0px 0.0px 0.0px 0.0px; font: 14.0px Arial; color: #333333 } p.p3 { margin: 0.0px 0.0px 0.0px 0.0px; font: 11.6px Arial; color: #333333 } li.li2 { margin: 0.

Distributed Systems 分布式系统

先来扯淡,几天是14年12月31日了,茫茫然,2014就剩最后一天了.这两天国大都放假,我给自己安排了四篇博客欠账,这就是其中的第一篇,简单介绍一些分布式系统的一些概念和设计思想吧.后面三篇分别是Network File System(NFS).Andrew File System(AFS)以及The design of naming schemes.废话不多说了,开始吧. 1. 本文讲什么 前面也提到了,本文会简单介绍一下分布式系统的一些概念和设计思想. 2. 分布式系统是什么 分布式系统,听

Distributed Systems: When you should build them, and how to scale. A step-by-step guide.

原文链接 https://medium.com/free-code-camp/distributed-systems-when-you-should-build-them-and-how-to-scale-a-step-by-step-guide-37e76a177218 分布式系统:何时构建它们以及如何扩展,分步指南. 当我开始创建产品时,有多少初级开发人员患有冒名顶替综合症,这总是让我感到震惊. 我明白了,有很多令人兴奋的例子顶尖公司与可以解决极其复杂的分布式系统数十亿请求,优雅地提升数以百