Let it crash philosophy part II

Designing fault tolerant systems is extremely difficult.  You can try to anticipate and reason about all of the things that can go wrong with your software and code defensively for these situations, but in a complex system it is very likely that some combination of events or inputs will eventually conspire against you to cause a failure or bug in the system.

In certain areas of the software community such as Erlang and Akka, there’s a philosophy that rather than trying to handle and recover from all possible exceptional and failure states, you should instead simply fail early and let your processes crash, but then recycle them back into the pool to serve the next request.  This gives the system a kind of self healing property where it recovers from failure without ceremony, whilst freeing up the developer from overly defensive error handling.

I believe that implementing let it crash semantics and working within this mindset will improve almost any application – not just real time Telecoms system where Erlang was born.  By adopting let it crash, redundancy and defence against errors will be baked into the architecture rather than trying to defensively anticipate scenarios right down in the guts of the code.  It will also encourage you to implement more redundancy throughout your system.

Also ask yourself, if the components or services in your application did crash, how well would your system recover with or without human intervention?  Very few applications will have a full automatic recoverability property, and yet implementing this feels like relatively low hanging fruit compared to writing 100% fault tolerant code.

So how do we start to put this into practice?

At the hardware level, you can obviously look towards the ‘Google model’ of commodity servers, whereby the failure of any given server supporting the system does not lead to a fatal degradation of service.  This is easier in the cloud world where the economics encourage us to use a larger number of small virtualised servers.   Just let them crash  and design for the fact that servers can die at a moments notice.

Your application might be comprised of different logical services. Think a user authentication service or a shopping cart system. Design the system to let entire services crash . Where appropriate, your application should be able to proceed and degrade gracefully whilst the service is not available, or to fall back onto another instance of the service whilst the first one is recycling.  Nothing should be in the critical code path because it might crash!

Ideally, your distributed system will be organised to scale horizontally across different server nodes.  The system should load balance or intelligently route between processes in the pool, and different nodes should be able to join or leave the pool without too much ceremony or impact to the application.  When you have this style of horizontal scalability, let nodes within your application crash and rejoin the pool when they’re ready.

What if we go further and implement let it crash semantics for our infrastructure?

For instance, say we have some messaging system or message broker that transports messages between the components of your application.  What if we let that crash and come back online later.  Could you design the application so that this is not as fatal as it sounds, perhaps by allowing application components to write to or dynamically switch between two message brokers?

Distributed NoSQL data stores gives us let it crash capability at the data persistence level.  Data will be stored in some distributed grid of nodes and replicated to at least 2 different hardware nodes.  At this point, it’s easier to let database nodes crash than try to achieve 100% uptime.

At the network level, we can design topologies such that we do not care if routers or  network links crash because there’s always some alternate route through the network.   Let them crash and when they come back the optimal routes will be there ready for our application to make use of again in future.

Let it crash is more than simple redundancy.  It’s about implementing self recoverability of the application.  It’s about putting your site reliability efforts into your architecture rather than low level defensive coding.  It’s about decoupling your application and introducing asynchronicity in recognition that things go wrong in surprising ways.  Ironically, sitting back and cooly letting your software crash can lead to better software!

时间: 2024-10-05 23:14:27

Let it crash philosophy part II的相关文章

Let it crash philosophy for distributed systems

This past weekend I read Joe Armstrong’s paper on the history of Erlang. Now, HOPL papers in general are like candy for me, and this one did not disappoint. There’s more in this paper that I can cover in one post, so today I’m going to concentrate on

[bzoj 2154]Crash的数字表格

Description 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b)表示能同时被a和b整除的最小正整数.例如,LCM(6, 8) = 24.回到家后,Crash还在想着课上学的东西,为了研究最小公倍数,他画了一张N*M的表格.每个格子里写了一个数字,其中第i行第j列的那个格子里写着数为LCM(i, j).一个4*5的表格如下: 1 2 3 4 5 2 2 6 4 10 3 6 3 12 15 4 4 12

BZOJ 2154: Crash的数字表格 [莫比乌斯反演]

2154: Crash的数字表格 Time Limit: 20 Sec  Memory Limit: 259 MBSubmit: 2924  Solved: 1091[Submit][Status][Discuss] Description 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b)表示能同时被a和b整除的最小正整数.例如,LCM(6, 8) = 24.回到家后,Crash还在想着课上学的东西,为了研究

【莫比乌斯反演】关于Mobius反演与lcm的一些关系与问题简化(BZOJ 2154 crash的数字表格&&BZOJ 2693 jzptab)

BZOJ 2154 crash的数字表格 Description 今天的数学课上,Crash小朋友学习了最小公倍数(Least Common Multiple).对于两个正整数a和b,LCM(a, b)表示能同时被a和b整除的最小正整数.例如,LCM(6, 8) = 24.回到家后,Crash还在想着课上学的东西,为了研究最小公倍数,他画了一张N*M的表格.每个格子里写了一个数字,其中第i行第j列的那个格子里写着数为LCM(i, j).一个4*5的表格如下: 1 2 3 4 5 2 2 6 4

JVM Crash 日志(hs_err_pid.log)分析

大家好,最新碰到JVM Crash的问题,拿出来跟大家分享下. 这个文件将包括: 触发致命错误的操作异常或者信号: 版本和配置信息: 触发致命异常的线程详细信息和线程栈: 当前运行的线程列表和它们的状态: 堆的总括信息: 加载的本地库: 命令行参数: 环境变量: 操作系统CPU的详细信息. 第一部分 ## A fatal error has been detected by the Java Runtime Environment:#  #  SIGSEGV (0xb) at      pc=0

[LeetCode] 349 Intersection of Two Arrays & 350 Intersection of Two Arrays II

这两道题都是求两个数组之间的重复元素,因此把它们放在一起. 原题地址: 349 Intersection of Two Arrays :https://leetcode.com/problems/intersection-of-two-arrays/description/ 350 Intersection of Two Arrays II:https://leetcode.com/problems/intersection-of-two-arrays-ii/description/ 题目&解法

iOS10 UIImageWriteToSavedPhotosAlbum程序奔溃crash解决办法

Xcode报错: This app has crashed because it attempted to access privacy-sensitive data without a usage description. The app's Info.plist must contain an NSPhotoLibraryUsageDescription key with a string value explaining to the user how the app uses this

使用华邦的SPI FLASH作为EPCS时固化NIOS II软件报错及解决方案

Altera器件有EPCS系列配置器件,其实,这些配置器件就是我们平时通用的SPIFlash,据AlteraFAE描述:"EPCS器件也是选用某家公司的SPIFlash,只是中间经过Altera公司的严格测试,所以稳定性及耐用性都超过通用的SPIFlash".就本人看来,半导体的稳定性问题绝大部分都是由本身设计缺陷造成的,而成熟的制造工艺不会造成产品的不稳定:并且,现在Altera的器件在读入配置数据发生错误时,可以重新读取SPIFlash里面的数据,所以在工艺的稳定性以及设计的可靠性

hdu 1207 汉诺塔II (DP+递推)

汉诺塔II Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others)Total Submission(s): 4529    Accepted Submission(s): 2231 Problem Description 经典的汉诺塔问题经常作为一个递归的经典例题存在.可能有人并不知道汉诺塔问题的典故.汉诺塔来源于印度传说的一个故事,上帝创造世界时作了三根金刚石柱子,在一根柱子上从下往