[转] Barriers and journaling filesystems

http://lwn.net/Articles/283161/

Journaling filesystems come with a big promise: they free system administrators from the need to worry about disk corruption resulting from system crashes. It is, in fact, not even necessary to run a filesystem integrity checker in such situations. The real world, of course, is a little messier than that. As a recent discussion shows, it may be even messier than many of us thought, with the integrity promises of journaling filesystems being traded off against performance.

A filesystem like ext3 works by maintaining a journal on a dedicated portion of the disk. Whenever a set of filesystem metadata changes are to be made, they are first written to the journal - without changing the rest of the filesystem. Once all of those changes have been journaled, a "commit record" is added to the journal to indicate that everything else there is valid. Only after the journal transaction has been committed in this fashion can the kernel do the real metadata writes at its leisure; should the system crash in the middle, the information needed to safely finish the job can be found in the journal. There will be no filesystem corruption caused by a partial metadata update.

There is a hitch, though: the filesystem code must, before writing the commit record, be absolutely sure that all of the transaction‘s information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel‘s block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times.

There is another hitch: the ext3 and ext4 filesystems, by default, do not use barriers. The option is there, but, unless the administrator has explicitly requested the use of barriers, these filesystems operate without them - though some distributions (notably SUSE) change that default. Eric Sandeen recently decided that this was not the best situation, so he submitted a patch changing the default for ext3 and ext4. That‘s when the discussion started.

Andrew Morton‘s response tells a lot about why this default is set the way it is:

Last time this came up lots of workloads slowed down by 30% so I dropped the patches in horror. I just don‘t think we can quietly go and slow everyone‘s machines down by this much...

There are no happy solutions here, and I‘m inclined to let this dog remain asleep and continue to leave it up to distributors to decide what their default should be.

So barriers are disabled by default because they have a serious impact on performance. And, beyond that, the fact is that people get away with running their filesystems without using barriers. Reports of ext3 filesystem corruption are few and far between.

It turns out that the "getting away with it" factor is not just luck. Ted Ts‘o explains what‘s going on: the journal on ext3/ext4 filesystems is normally contiguous on the physical media. The filesystem code tries to create it that way, and, since the journal is normally created at the same time as the filesystem itself, contiguous space is easy to come by. Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things. The commit record will naturally be written just after all of the other journal log data has made it to the media.

That said, nobody is foolish enough to claim that things will always happen that way. Disk drives have a certain well-documented tendency to stop cooperating at inopportune times. Beyond that, the journal is essentially a circular buffer; when a transaction wraps off the end, the commit record may be on an earlier block than some of the journal data. And so on. So the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen fairly reliably. There can be no doubt that running without barriers is less safe than using them.

Anybody can turn on barriers if they are willing to take the performance hit. Unless, of course, their filesystem is based on an LVM volume (as certain distributions do by default); it turns out that the device mapper code does not pass through or honor barriers. But, for everybody else, it would be nice if that performance cost could be reduced somewhat. And it seems that might be possible.

The current ext3 code - when barriers are enabled - performs a sequence of operations like this for each transaction:

  1. The log blocks are written to the journal.

  2. A barrier operation is performed.
  3. The commit record is written.
  4. Another barrier is executed.
  5. Metadata writes begin at some later point.

On ext4, the first barrier (step 2) can be omitted because the ext4 filesystem supports checksums on the journal. If the journal log data and the commit record are reordered, and if the operation is interrupted by a crash, the journal‘s checksum will not match the one stored in the commit record and the transaction will be discarded. Chris Mason suggests that it would be "mostly safe" to omit that barrier with ext3 as well, with a possible exception when the journal wraps around.

Another idea for making things faster is to defer barrier operations when possible. If there is no pressing need to flush things out, a few transactions can be built up in the journal and all shoved out with a single barrier. There is also some potential for improvement by carefully ordering operations so that barriers (which are normally implemented as "flush all outstanding operations to media" requests) do not force the writing of blocks which do not have specific ordering requirements.

In summary: it looks like the time has come to figure out how to make the cost of barriers palatable. Ted Ts‘o seems to feel that way:

I think we have to enable barriers for ext3/4, and then work to improve the overhead in ext4/jbd2. It‘s probably true that the vast majority of systems don‘t run under conditions similar to what Chris used to demonstrate the problem, but the default has to be filesystem safety.

[转] Barriers and journaling filesystems,布布扣,bubuko.com

时间: 2024-11-03 05:40:05

[转] Barriers and journaling filesystems的相关文章

[转] Barriers, Caches, Filesystems

Barriers, Caches, Filesystems http://monolight.cc/2011/06/barriers-caches-filesystems/ With the recent proliferation of ext4 as the new "default" Linux filesystem there's been much talk of write barrier support. The flurry of post-2.6.18 barrier

浅析线程间通信三:Barriers、信号量(semaphores)以及各种同步方法比较

之前的文章讨论了互斥量.条件变量.读写锁和自旋锁用于线程的同步,本文将首先讨论Barriers和信号量的使用,并给出了相应的代码和注意事项,相关代码也可在我的github上下载,然后对线程各种同步方法进行了比较. Barriers Barriers是一种不同于前面线程同步机制,它主要用于协调多个线程并行(parallel)共同完成某项任务.一个barrier对象可以使得每个线程阻塞,直到所有协同(合作完成某项任务)的线程执行到某个指定的点,才让这些线程继续执行.前面使用的pthread_join

Synthesis of memory barriers

A framework is provided for automatic inference of memory fences in concurrent programs. A method is provided for generating a set of ordering constraints that prevent executions of a program violating a specification. One or more incoming avoidable

Crash Consistency : FSCK and Journaling

现在开始今天的第三篇博客的撰写,不能扯淡了,好多任务啊.但是还是忍不住吐槽一下,之前选择这篇文章纯属是个意外,我把Crash看做了Cache,唉,要不然也就不用写这篇文章了. 1. 这篇博客讲什么? 本文讲述两种方法来增强文件系统的健壮性,也就是说机器的突然故障对数据造成的影响可以被恢复.第一种被称之为FSCK(File System Checker),说白了就是扫描整个磁盘按照各种情况进行恢复,本文对它不感兴趣(因为复杂且不实用,喜欢的可以看后面的参考资料):第二种是Journaling方法,

MongoDB实战指南(四):MongoDB的Journaling日志功能

mongoDB的Journaling日志功能与常见的log日志是不一样的,mongoDB也有log日志,它只是简单记录了数据库在服务器上的启动信息.慢查询记录.数据库异常信息.客户端与数据库服务器连接.断开等信息.Journaling日志功能则是mongoDB里面非常重要的一个功能,它保证了数据库服务器在意外断电.自然灾害等情况发生下数据的完整性.尽管mongoDB还提供了其它的复制集等备份措施(后面会分析),但Journaling的功能在生产环境中是不可缺少的,它依靠了较小的CPU和内存消耗,

技术向|内存屏障(Memory Barriers)--Runtime Time

在讨论CPU的内存屏障之前,让我们先了解一下缓存结构. 缓存(Cache)结构简介 现代计算机系统的缓存结构粗略如下: 每个CPU都有自己的缓存. 缓存(Cache)分为又分多个级别. 一级缓存L1的访问非常接近一个cpu周期(cycles),二级缓存L2的存取可能就要大概10个周期了. 缓存和内存交换数据的最小单元叫Cache Line.它是一个固定长度的块,可能是16到256字节(bytes). 比如一个32位的CPU有1M的缓存,每个Cache Line的大小是64bytes.那么这个缓存

Memory Barriers

这回该进入主题了. 上一文最后提到了 Memory Barriers ,即内存屏障,因为对一个 CPU 而言,a = 1; b = 1. 由于在中间加了内存屏障,在 X86 架构下,就是 mfence 指令,此时在上一文中运行时,情况就变成这样了,当 CPU0 发 出 "read invalidate" 消息后,就会开始执行 mfence 指令,该指令把 Store Buffer 中的项都标记一下,然后开始执行 b = 1,此时虽然 cache hint (cache 命中),但是由于

内存关卡/栅栏 ( Memory Barriers / Fences ) – 译

翻译自:Martin Thompson – Memory Barriers/Fences 在这篇文章里,我将讨论并发编程里最基础的技术–以内存关卡或栅栏著称,那让进程内的内存状态对其他进程可见. CPU 使用了很多技术去尝试和适应这样的事实:CPU 执行单元的性能已远远超出主内存性能.在我的"Writing Combining"文章,我只是谈及其中一种技术.CPU 使用的用来隐藏内存延迟的最普通技术是管线化指令,然后付出巨大努力和资源去尝试重排序这些管线来最小化缓存不命中的有关拖延.

Memory Barriers/Fences(内存关卡/栅栏 —原文)

In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors. CPUs have employed many techniques to try and accomm