High availability with the Distributed Replicated Block Device

The 2.6.33 Linux? kernel has introduced a useful new service called the Distributed Replicated Block Device (DRBD). This service mirrors an entire block device to another networked host during run time, permitting the development of high-availability clusters for block data. Explore the ideas behind the DRBD and its implementation in the Linux kernel.

The Distributed Replicated Block Device (DRBD) provides a networked version of data mirroring, classified under the redundant array of independent disks (RAID) taxonomy as RAID-1. Let‘s begin with a quick introduction to high availability (HA) and RAID, and then explore the architecture and use of the DRBD.

Introducing high availability

High availability is a system design principle for increased availability. Availability, or the measure of a system‘s operational continuity, is commonly defined as a percentage of uptime within the span of a year. For example, if a given system is available 99% of the time, then its downtime for a year is measured as 3.65 days. The value 99% is usually called two nines. Compare this to five nines (99.999%), and the maximum downtime falls to 5.26 minutes per year. That‘s quite a difference and requires careful design and high quality to achieve.

One of the most common implementations for HA is redundancy with failover. In this model, for example, you can define multiple paths to a given resource, with the available path being used and the redundant path used upon failure. Enterprise-class disk drives illustrate this concept, as they provide two ports of access (compared to one access port for consumer-grade drives).

As I write this, I‘m sitting on a Boeing 757. Each wing includes its own jet engine. Although the engines are extremely reliable, one could fail, and the plane could continue to fly safely with that remaining single engine. That‘s HA (via redundancy) and applies to many applications and scenarios.

My first job was for a large defense company building geosynchronous communications satellites. At the core of these satellites was a radiation-hardened computing system that was responsible for command and telemetry (a satellite‘s user interface), power and thermal management, and pointing (otherwise known as keeping telephone conversations and television content flowing). For availability, this computing system was a redundant design, with two sets of processors and buses and the ability to switch between a master and a slave if the master was found to be unresponsive. To make a long story short, redundancy in systems design is a common technique to increase availability at the cost of additional hardware (and software).

Redundancy in storage

Linux kernel inclusion

The process for inclusion into the Linux kernel for DRBD started back in July 2007. At that point, DRBD was at version 8.0. Two-and-a-half years later, in December 2009, DRBD entered the mainline 2.6.33 kernel (DRBD version 8.3.7). Today, the 8.3.8 DRBD release is in the current 2.6.35 Linux kernel.

Not surprisingly, using redundancy in storage systems is also common, particularly in enterprise-class designs. It‘s so common that a standard approach—RAID—exists with a variety of underlying algorithms, each with different capabilities and characteristics.

RAID was first defined in 1987 at the University of California, Berkeley. Traditional RAID levels include RAID-0, which implements striping across disks for performance (but not redundancy), and RAID-1, which implements mirroring across two disks so that two copies of information exist. With RAID-1, a disk can fail, and information can still be acquired through the other copy. Other RAID levels include RAID-5, which includes block-level striping with distributed parity codes across disks, and RAID-6, which includes block-level striping with double distributed parity. Although RAID-5 can support failure of a single drive, RAID-6 can support two drive failures (though more capacity is consumed through parity information). RAID-1 is simple, but it‘s wasteful in terms of capacity utilization. RAID-5 and RAID-6 are more frugal with respect to storage capacity, but they typically require additional hardware processing to avoid burdening the processor with the parity calculations. As usual, trade-offs abound. Figure 1 provides a graphical summary of these RAID-0 and RAID-1 schemes.

Figure 1. Graphical summary of RAID schemes for levels 0 and 1 Linux

RAID technologies continue to evolve, with a number of so-called nonstandard techniques coming into play. These techniques include Oracle‘s RAID-Z scheme (which solves RAID-5‘s write-hold problem); NetApp‘s RAID-DP (for diagonal parity), which extends RAID-6; and IBM‘s RAID 1E (for enhanced), which implements both striping (RAID-0) and mirroring (RAID-1) over an odd number of disks. Numerous other traditional and nontraditional RAID schemes exist: See the links in Resources for details.

Back to top

DRBD operation

Now, let‘s look at the basic operation of the DRBD prior to digging into the architecture. Figure 2 provides an overview of DRBD in the context of two independent servers that provide independent storage resources. One of the servers is commonly defined as the primary and the other secondary (typically as part of a clustering solution). Users access the DRBD block devices as a traditional local block device or as a storage area network or network-attached storage solution. The DRBD software provides synchronization between the primary and secondary servers for user-based Read and Write operations as well as other synchronization operations.

Figure 2. Basic DRBD model of operation

In the active/passive model, the primary node is used for Read and Write operations for all users. The secondary node is promoted to primary if the clustering solution detects that the primary node is down. Write operations occur through the primary node and are performed to the local storage and secondary storage simultaneously (see Figure 3). DRBD supports two modes for Write operations called fully synchronous and asynchronous.

In fully synchronous mode, Write operations must be safely on both nodes‘ storage before the Write transaction is acknowledged to the writer. In asynchronous mode, the Write transaction is acknowledged after the write data is stored on the local node‘s storage; the replication of the data to the peer node occurs in the background. Asynchronous mode is less safe, because a window exists for a failure to occur before data is replicated, but it is faster than fully synchronous mode, which is the safest mode for data protection. Although fully synchronous mode is recommended, asynchronous mode is useful in situations where replication occurs over longer distances (such as over the wide area network for geographic disaster recovery scenarios). Read operations are performed using local storage (unless the local disk has failed, at which point the secondary storage is accessed through the secondary node).

Figure 3. Read/Write operations with DRBD

DRBD can also support the active/active model, such that Read and Write operations can occur at both servers simultaneously in what‘s called the shared-disk mode. This mode relies on a shared-disk file system, such as the Global File System (GFS) or the Oracle Cluster File System version 2 (OCFS2), which includes distributed lock-management capabilities.

Back to top

DRBD architecture

DRBD is split into two independent pieces: a kernel module that implements the DRBD behaviors and a set of user-space administration applications used to manage the DRBD disks (see Figure 4). The kernel module implements a driver for a virtual block device (which is replicated between a local disk and a remote disk across the network). As a virtual disk, DRBD provides a flexible model that a variety of applications can use (from file systems to other applications that can rely on a raw disk, such as a database). The DRBD module implements an interface not only to the underlying block driver (as defined by the disk configuration item in drbd.conf) but also the networking stack (whose endpoint is defined by an IP address and port number, also in drbd.conf).

Figure 4. DRBD in the Linux architecture

In user space, DRBD provides a set of utilities for managing replicated disks. You use the drbdsetup utility to configure the DRBD module in the Linux kernel and drbdmeta to manage DRBD‘s metadata structures. A wrapper utility that uses both of these utilities is drbdadm. This high-level administration tool is the one most commonly used (grabbing details from the DRBD configuration file in /etc/drbd.conf). As a front end to the previously discussed utilities, drbdadm is the most commonly used to manage DRBD.

Using the disk model, DRBD exports a special device (/dev/drbdX) that you can use just like a regular disk. Listing 1 illustrates building a file system and mounting the DRBD for use by the host (though it omits other necessary configuration steps, which are referenced in the Resources section).

Listing 1. Building and mounting a file system on a primary DRBD disk
# mkfs.ext3 /dev/drbd0
# mkdir /mnt/drbd
# mount -t ext3 /dev/drbd0 /mnt/drbd

You can use the virtual disk that DRBD provides like any other disk, with the replication occurring transparently underneath. Now, take a look at some of the major features of DRBD, including its ability to self-heal.

Back to top

DRBD major features

Although the idea of a replicated disk is conceptually simple (and its development relatively straightforward), there are inherent complexities in a robust implementation. For example, replicating blocks to a networked drive is fairly simple, but handling failures and transient outages (and the resulting synchronization of the drives) is where the real solution begins. This section describes the major features that DRBD provides, including the variety of failure models that DRBD supports.

Replication modes

Earlier, this article explored the various methods for replicating data between nodes (two in particular—fully synchronous and asynchronous). DRBD supports a variation on each method that provides a bit more data protection than asynchronous at a slight cost in performance. The memory (or semi-) synchronous mode is a variation of both synchronous and asynchronous. In this mode, the Write operation is acknowledged after the data is stored on the local disk and mirrored to the peer node‘s memory. This mode provides more protection, because the data is mirrored to another node, just in volatile memory instead of the non-volatile disk. It‘s still possible to lose data (for example, if both nodes failed), but failure of the primary node will not cause data loss, because the data has been replicated.

Online device verification

DRBD permits the verification of local and peer devices in an online fashion (while input/output occurs). This verification means that DRBD verifies that the local and remote disk are replicas of one another, which can be a time-consuming operation. But rather than move data between nodes to validate, DRBD takes a much more efficient approach. To preserve bandwidth between the nodes (likely a constrained resource), DRBD doesn‘t move data between nodes to validate but instead moves cryptographic digests of the data (hash). In this way, a node computes a hash of a block; transfers the much smaller signature to the peer node, which also calculates the hash; and then compares them. If the hashes are the same, the blocks are properly replicated. But if the hashes differ, the out-of-date block is marked as out of sync, and subsequent synchronization ensures that the block is properly synchronized.

Communication integrity

Communicating between nodes has the potential to introduce errors into the replicated data (either from a software or firmware bug or from any other error not detected by TCP/IP‘s checksum). To provide data integrity, DRBD calculates message integrity codes to accompany data moving between nodes. This allows the receiving node to validate its incoming data and request retransmission when an error is found. DRBD uses the Linux crypto application programming interface and is therefore flexible on the integrity algorithm used.

Automatic recovery

DRBD can recover from a wide variety of errors, but one of the most insidious is the so-called "split brain" situation. In this error scenario, the communication link fails between the nodes, and both nodes believe that they are the primary node. While primary, each node permits Write operations, without those operations being propagated to the peer node. This leads inconsistent storage in each node.

In most cases, split-brain recovery is performed manually, but DRBD provides several automatic methods for recovering from this situation. The recovery algorithm used depends on how the storage is actually used.

The simplest approach to synchronizing storage after split-brain is when one node saw no changes occur while the link was down. In this case, the node that had changes simply synchronizes with the latent peer. Another simple approach is to discard changes from one node that had the lesser number of changes. This permits the node with the largest change-set to continue but means that changes to one host will be lost.

The other two approaches discard changes based on the temporal states of the nodes. In one approach, changes are discarded from the node that switched to primary last. In the other, changes are discard from the oldest primary (the node that switched to primary first). You can manipulate each of these nodes within the DRBD configuration file, but their use ultimately depends upon the application using the storage and whether data can be discarded or manual recovery is necessary.

Optimizing synchronization

A key aspect of a replicated storage device is an efficient method for synchronizing data between nodes. Two of the schemes that DRBD uses are activity logs and the quick-sync bitmap. The activity log stores blocks that were recently written to and define which blocks need to be synchronized after a failure is resolved. The quick-sync bitmap defines the blocks that are in sync (or out of sync) during a time of disconnection. When the nodes are reconnected, synchronization can use this bitmap to quickly synchronize the nodes to be exact replicas of one another. This time is important, because it represents the window during which the secondary disk is inconsistent.

Back to top

Conclusion

DRBD is a great asset if you‘re looking to increase the availability of your data, even on commodity hardware. It can be easily installed as a kernel module and configured using the available administration tools and wrappers. Even better, DRBD is open source, allowing you to tailor it to your needs (but check the DRBD road map first to see whether your need is in the works). DRBD supports a large number of useful options, so you can optimize it to uniquely fit your application.

Resources

Learn
  • The DRBD website provides the latest information on DRBD, its current feature list, a road map, and a description of the technology. You can also find a list of DRBD papers and presentations. Although DRBD is part of the mainline kernel (since 2.6.33), you can grab the latest source tarball at LINBIT.

  • High availability is a system property that ensures a degree of operation. This property typically involves redundancy as a way to avoid a single point of failure. Fault-tolerant system design is another important aspect for increasing availability.
  • The concept of RAID was born at the University of California, Berkeley in 1987. RAID is defined by levels, which specify the storage architecture and characteristics of the protection. You can learn more about the original RAID concept in the seminal paper "A Case for Redundant Arrays of Inexpensive Disks (RAID)."
  • Ubuntu provides a useful page for configuring and using DRBD. This page illustrates configuration of DRBD on primary and secondary hosts as well as testing DRBD in a number of failure scenarios.
  • DRBD is most useful in conjunction with clustering applications. Luckily, you can learn more about these applications and others (such as Pacemaker, Heartbeat, Logical Volume Manager, GFS, and OCFS2) and how they integrate with DRBD in the DRBD-enabled applications section of the DRBD manual.
  • This article referenced two shared-disk file systems—namely, the GFS and the OCFS2. Both are cluster file systems that embody high performance and HA.
  • In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well as downloads, discussion forums, and a wealth of other resources for Linux developers and administrators.
  • Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
  • Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools, as well as IT industry trends.
  • Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.
  • Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.

High availability with the Distributed Replicated Block Device,布布扣,bubuko.com

时间: 2024-10-23 05:41:15

High availability with the Distributed Replicated Block Device的相关文章

DRBD(Distributed Replicated Block Device) 分布式块设备复制 进行集群高可用方案

DRBD是一个用软件实现的.无共享的.服务器之间镜像块设备内容的存储复制解决方案. 外文名 DRBD drbdadm 高级管理工具 drbdsetup 置装载进kernel的DRBD模块 drbdmeta 管理META数据结构 目录 1 DRBD基础 2 DRBD功能 ? 工作原理 ? 版本 DRBD基础 编辑 Distributed Replicated Block Device(DRBD)是一个用软件实现的.无共享的.服务器之间镜像块设备内容的存储复制解决方案. DRBD Logo 数据镜像

centos6.4 ceph安装部署之ceph block device

1,prelight/preface ceph storage clusterceph block deviceceph filesystemceph object storage 此篇记录ceph block device,ceph storage cluster见上一篇 Realiable Autonomic Distributed Object Store(可靠自主分布式对象存储) long term supported [ruiy tips Notes] ERROR: modinfo:

Ceph安装QEMU报错:User requested feature rados block device configure was not able to find it

CentOS6.3中,要想使用Ceph的block device,需要安装更高版本的QEMU. 安装好ceph后,安装qemu-1.5.2 # tar -xjvf qemu-1.5.2.tar.bz2 # cd qemu-1.5.2 # ./configure --enable-rbd 一定要加上--enable-rbd选项,这样qemu才能支持rbd协议. 这一步可能会报错: ERROR: User requested feature rados block device configure

mount: block device /dev/cdrom is write-protected, mounting read-only 解决方法

[[email protected] ~]# mount /dev/cdrom /mnt/cdrom/ mount: block device /dev/sr0 is write-protected, mounting read-only 虚拟机挂着光驱光驱时提示只读,用以下命令可解决该报错, mount -o remount,rw /dev/cdrom /mnt/cdrom 参考网址:http://blog.chinaunix.net/uid-30645967-id-5701870.html

Linux Block Device Management

This article talks about how the device number management works in Linux block subsystem. However, the idea applies to character device number management too. Note: I am referring to linux-3.17 source code. The purpose of introducing device number ma

QEMU KVM Libvirt手册(6) – Network Block Device

网络块设备是通过NBD Server将虚拟块设备通过TCP/IP export出来,可以远程访问. NBD Server通常是qemu-nbd 可以提供unix socket qemu-nbd -t -k /home/cliu8/images/ubuntutest-nbd ubuntutest.img 打开另一个窗口,可以连接这个unix socket qemu-system-x86_64 -enable-kvm -name ubuntutest  -m 2048 -hda nbd:unix:/

RMAN备份失败之:mount: block device /dev/emcpowerc1 is write-protected, mounting read-only

今天再做巡检的时候发现有一台服务器的RMAN备份不正常,有一段时间没能正常备份了.检查了一下脚本,正常,定时任务列表也正常,再检查一下/var/log/cron的内容,也没有问题.尝试在该挂载点上创建一个1.txt文件的时候,发现有异常报出来了.内容为:mount: block device /dev/emcpowerc1 is write-protected, mounting read-only.原来是因为磁盘为只读状态,不可写入,导致了备份失败. 解决办法: umount /dev/emc

Mount 挂载错误mount:block device /dev/sr0 is write – protected , mounting read-only

Mount 挂载错误mount:block device /dev/sr0 is write – protected , mounting read-only 安装虚拟机出现以下提示: mount:block device /dev/sr0 is write - protected , mounting read-only 说明系统光驱加载成功,因为光驱是只读的,所以提示write-protected,mounting read-only,sr0是光驱设备名, 通过以下命令: ll /dev/c

Nova: 虚机的块设备总结 [Nova Instance Block Device]

和物理机一样,虚拟机包括几个重要的部分:CPU.内存.磁盘设备.网络设备等.本文将简要总结虚机磁盘设备有关知识. 1. Nova boot CLI 中有关虚机块设备的几个参数 nova boot CLI 的完整参数如下: usage: nova boot [--flavor <flavor>] [--image <image>] //boot from image with id [--image-with <key=value>] //image metadata p