使用CopyTable工具方法在线备份HBase表

CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how
to use it, and some common configuration caveats.

Use cases:

CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path
interface. It can be used for many purposes:

  • Internal copy of a table (Poor man’s snapshot)
  • Remote HBase instance backup
  • Incremental HBase table copies
  • Partial HBase table copies and HBase table schema changes

Assumptions and limitations:

The CopyTable tool has some basic assumptions and limitations. First, if being used in the multi-cluster situation, both clusters must be online and the target instance needs to have the target table present with the same column families defined as the source
table.

Since the tool uses standards scans and puts, the target cluster doesn’t have to have the same number of nodes or regions.  In fact, it can have different numbers of tables, different numbers of region servers, and could have completely different region split
boundaries. Since we are copying entire tables, you can use performance optimization settings like setting larger scanner caching values for more efficiency. Using the put interface also means that copies can be made between clusters of different minor versions.
(0.90.4 -> 0.90.6, CDH3u3 -> CDH3u4) or versions that are wire compatible (0.92.1 -> 0.94.0).

Finally, HBase only provides row-level ACID guarantees; this means while a CopyTable is going on, newly inserted or updated rows may occur and these concurrent edits will either be completely included or completely excluded. While rows will be consistent, there
is no guarantees about the consistency, causality, or order of puts on the other rows.

Internal copy of a table (Poor man’s snapshot)

Versions of HBase up to and including the most recent 0.94.x versions do not support table snapshotting. Despite HBase’s ACID limitations, CopyTable can be used as a naive snapshotting mechanism that makes a physical copy of a particular table.

Let’s say that we have a table, tableOrig with column-families cf1 and cf2. We want to copy all its data to tableCopy. We need to first create tableCopy with the same column families:

dstCluster$ echo "create ‘tableOrig‘, ‘cf1‘, ‘cf2‘" | hbase shell

We can then create and copy the table with a new name on the same HBase instance:

srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --new.name=tableCopy tableOrig

This starts an MR job that will copy the data.

Remote HBase instance backup

Let’s say we want to copy data to another cluster. This could be a one-off backup, a periodic job or could be for bootstrapping for cross-cluster replication. In this example, we’ll have two separate clusters: srcCluster and dstCluster.

In this multi-cluster case, CopyTable is a push process — your source will be the HBase instance your current hbase-site.xml refers to and the added arguments point to the destination cluster and table. This also assumes that all of the MR TaskTrackers can
access all the HBase and ZK nodes in the destination cluster. This mechanism for configuration also means that you could run this as a job on a remote cluster by overriding the hbase/mr configs to use settings from any accessible remote cluster and specify
the ZK nodes in the destination cluster. This could be useful if you wanted to copy data from an HBase cluster with lower SLAs and didn’t want to run MR jobs on them directly.

You will use the the –peer.adr setting to specify the destination cluster’s ZK ensemble (e.g. the cluster you are copying to). For this we need the ZK quorum’s IP and port as well as the HBase root ZK node for our HBase instance. Let’s say one of these machine
is srcClusterZK (listed in hbase.zookeeper.quorum) and that we are using the default zk client port 2181 (hbase.zookeeper.property.clientPort) and the default ZK znode parent /hbase (zookeeper.znode.parent). (Note: If you had two HBase instances using the
same ZK, you’d need a different zookeeper.znode.parent for each cluster.

# create new tableOrig on destination cluster
dstCluster$ echo "create ‘tableOrig‘, ‘cf1‘, ‘cf2‘" | hbase shell
# on source cluster run copy table with destination ZK quorum specified using --peer.adr
# WARNING: In older versions, you are not alerted about any typo in these arguments!
srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase tableOrig

Note that you can use the –new.name argument with the –peer.adr to copy to a differently named table on the dstCluster.

# create new tableCopy on destination cluster
dstCluster$ echo "create ‘tableCopy‘, ‘cf1‘, ‘cf2‘" | hbase shell
# on source cluster run copy table with destination --peer.adr and --new.name arguments.
srcCluster$ hbase org.apache.hadoop.hbase.mapreduce.CopyTable --peer.adr=dstClusterZK:2181:/hbase --new.name=tableCopy tableOrig

This will copy data from tableOrig on the srcCluster to the dstCluster’s tableCopy table.

Incremental HBase table copies

Once you have a copy of a table on a destination cluster, how do you do copy new data that is later written to the source cluster? Naively, you could run the CopyTable job again and copy over the entire table. However, CopyTable provides a more efficient incremental
copy mechanism that just copies the updated rows from the srcCluster to the backup dstCluster specified in a window of time. Thus, after the initial copy, you could then have a periodic cron job that copies data from only the previous hour from srcCluster
to the dstCuster.

This is done by specifying the –starttime and –endtime arguments. Times are specified as decimal milliseconds since unix epoch time.

# WARNING: In older versions, you are not alerted about any typo in these arguments!
# copy from beginning of time until timeEnd 
# NOTE: Must include start time for end time to be respected. start time cannot be 0.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=1 --endtime=timeEnd ...
# Copy from starting from and including timeStart until the end of time.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=timeStart ...
# Copy entries rows with start time1 including time1 and ending at timeStart excluding timeEnd.
srcCluster$ hbase org.apache.hadoop.HBase.mapreduce.CopyTable ... --starttime=timestart --endtime=timeEnd

Partial HBase table copies and HBase table schema changes

By default, CopyTable will copy all column families from matching rows. CopyTable provides options for only copying data from specific column-families. This could be useful for copying original source data and excluding derived data column families that are
added by follow on processing.

By adding these arguments we only copy data from the specified column families.

  • –families=srcCf1
  • –families=srcCf1,srcCf2

Starting from 0.92.0 you can copy while changing the column family name:

  • –families=srcCf1:dstCf1

    • copy from srcCf1 to dstCf1
  • –families=srcCf1:dstCf1,dstCf2,srcCf3:dstCf3
    • copy from srcCf1 to destCf1, copy dstCf2 to dstCf2 (no rename), and srcCf3 to dstCf3

Please note that dstCf* must be present in the dstCluster table!

Starting from 0.94.0 new options are offered to copy delete markers and to include a limited number of overwritten versions. Previously, if a row is deleted in the source cluster, the delete would not be copied — instead that a stale version of that row would
remain in the destination cluster. This takes advantage of some of the 0.94.0 release’s advanced features.

  • –versions=vers

    • where vers is the number of cell versions to copy (default is 1 aka the latest only)
  • –all.cells 
    • also copy delete markers and deleted cells

Common Pitfalls

The HBase client in the 0.90.x, 0.92.x, and 0.94.x versions always use zoo.cfg if it is in the classpath, even if an hbase-site.xml file specifies other ZooKeeper quorum configuration settings. This “feature” causes a problem common in CDH3 HBase because its
packages default to including a directory where zoo.cfg lives in HBase’s classpath. This can and has lead to frustration when trying to use CopyTable (HBASE-4614). The workaround for this is to exclude the zoo.cfg file from your HBase’s classpath and to specify
ZooKeeper configuration properties in your hbase-site.xml file. http://hbase.apache.org/book.html#zookeeper

Conclusion

CopyTable provides simple but effective disaster recovery insurance for HBase 0.90.x (CDH3) deployments. In conjunction with the replication feature found and supported in CDH4’s HBase 0.92.x based HBase, CopyTable’s incremental features become less valuable
but its core functionality is important for bootstrapping a replicated table. While more advanced features such as HBase snapshots (HBASE-50) may aid with disaster recovery when it gets implemented, CopyTable will still be a useful tool for the HBase administrator.

时间: 2024-08-05 14:26:09

使用CopyTable工具方法在线备份HBase表的相关文章

Mysql使用innobackupex在线备份方案(全量+增量)操作记录

在日常的linux运维工作中,对mysql数据库的备份是非常重要的一个环节.关于mysql的备份和恢复,比较传统的是用mysqldump工具.今天这里介绍下另一款mysql备份工具innobackupex,利用它对mysql做全量和增量备份,仅仅依据本人实战操作做一记录,如有误述,敬请指出~ 一.innobackupex的介绍Xtrabackup是由percona开发的一个开源软件,是使用perl语言完成的脚本工具,能够非常快速地备份与恢复mysql数据库,且支持在线热备份(备份时不影响数据读写

HBase表的备份

HBase表备份其实就是先将Table导出,再导入两个过程. 导出过程 //hbase org.apache.hadoop.hbase.mapreduce.Driver export 表名 数据文件位置//数据文件位置:可以是本地文件目录,也可以是hdfs路径//当其为前者时,必须加上前缀file:////当其为后者时,可以直接指定 "/root/test/users",也可以写路径 "hdfs://hadoop01:9000/root/test/users"//另

pt-online-schema-change工具使用教程(在线修改大表结构)

percona-toolkit中pt-online-schema-change工具安装和使用 pt-online-schema-change介绍 使用场景:在线修改大表结构 在数据库的维护中,总会涉及到生产环境上修改表结构的情况,修改一些小表影响很小,而修改大表时,往往影响业务的正常运转,如表数据量超过500W,1000W,甚至过亿时 在线修改大表的可能影响(1)在线修改大表的表结构执行时间往往不可预估,一般时间较长(2)由于修改表结构是表级锁,因此在修改表结构时,影响表写入操作(3)如果长时间

为mysql寻找最佳备份方法和备份

一.为什么要备份? 灾难恢复 需求改变 测试 二.事先考虑的问题 可以容忍丢失多长时间的数据? 恢复要在多长时间内完成? 是否需要持续提供服务? 需要恢复什么,整个数据库服务器?单个数据库?一个或多个表?某个语句? 三.备份类型 根据是否需要数据库离线分为: 冷备:coldbackup,关闭mysql服务,或不允许读写请求 温备:warmbackup,备份的同时仅支持读请求 热备:hotbackup,备份的同时,业务功能不受影响,需要工具和数据库引擎支持 根据要备份的数据范围可分为: 完全备份:

使用exp&imp工具进行数据库备份及恢复

使用exp&imp工具进行数据库备份及恢复1.exp/imp使用方法介绍exp/imp为一种数据库备份恢复工具,也可以作为不同数据库之间传递数据的工具,两个数据库所在的操作系统可以不同.exp可以将数据库数据导出为二进制文件,imp可以将导出的数据文件再导入到相同的数据库或不同的数据库.数据库导出有四种模式:full(全库导出).owner(用户导出).table(表导出).tablespace(表空间导出).full(全库导出):导出除ORDSYS.MDSYS.CTXSYS.ORDPLUGIN

五个在线备份解决方案及优势分析

在云计算快速.迅速.飞速发展的当下,你要是不用点和云计算相关的新产品你还真不好意思说你了解互联网.如今,在线备份已经成为个人和中小企业备份的重要工具,通过web页面的集中管理,就可以随时对数据进行备份,你说,是不是很酷!今天,笔者为大家带来了五款在线备份解决方案及优势分析,具体内容如下 一. MozyHome MozyHome 是一个安全的在线备份服务(即网络硬盘).它是用一种简单,智能而经济的方法来保护文件不被损坏.防止您的数据被意外删除或硬件故障等危害.定期备份你的数据到安全的服务器(美国)

dbms_redefinition在线重定义表结构 可以在表分区的时候使用

dbms_redefinition在线重定义表结构 (2013-08-29 22:52:58) 转载▼ 标签: dbms_redefinition 非分区表转换成分区表 王显伟 在线重定义表结构 在线转换非分区表 分类: ORACLE新特性实践 刚接手一套系统应用数据库,因为项目建设期间种种原因,库是非归档模式也没有备份,更让我无语的是有个表增长的比较快,将近90G大小,每隔一段时间都要删除前三个月以前的数据,然后再用shrink收缩空间,因为是非分区表,shrink很是浪费时间,而且很多时间无

Linux操作系统备份之一:使用LVM快照实现Linux操作系统数据的在线备份

这里我们讨论Linux操作系统的备份. 在生产环境,客户都会要求做全系统的数据备份,用于系统崩溃后的一种恢复手段.这其中就包含操作系统数据的备份恢复. 由于是生产环境,客户都会要求备份不中断业务,也就是在线备份. 今天我们介绍使用LVM快照实现Linux操作系统在线备份的一种方法. 使用LVM实现操作系统在线备份的简要原理是:LVM是Linux自带的卷管理软件,LVM支持快照,而Linux又支持将LVM卷作为操作系统分区,因此可以使用LVM的快照功能实现操作系统的在线备份. 1. 分区规划 上面

在线备份

一.常用备份: 下面的方法是比较简单且常用的SQLite数据库备份方式,见如下步骤:    1). 使用SQLite API或Shell工具在源数据库文件上加共享锁.    2). 使用Shell工具(cp或copy)拷贝数据库文件到备份目录.    3). 解除数据库文件上的共享锁.    以上3个步骤可以应用于大多数场景,而且速度也比较快,然而却存在一定的刚性缺陷,如:    1). 所有打算在源数据库上执行写操作的连接都不得不被挂起,直到整个拷贝过程结束并释放文件共享锁.    2). 不