Introducing shard translator

Introducing shard translator

by Krutika Dhananjay on December 23, 2015

GlusterFS-3.7.0 saw the release of sharding feature, among several others. The feature was tagged as “experimental” as it was still in the initial stages of development back then. Here is some introduction to the feature:

Why shard translator?

GlusterFS’ answer to very large files (those which can grow beyond a single brick) had never been clear. There is a stripe translator which allows you to do that, but that comes at a cost of flexibility – you can add servers only in multiple of stripe-count x replica-count, mixing striped and unstriped files is not possible in an “elegant” way. This also happens to be a big limiting factor for the big data/Hadoop use case where super large files are the norm (and where you want to split a file even if it could fit within a single server.) The proposed solution for this is to replace the current stripe translator with a new “shard” translator.

What?

Unlike stripe, Shard is not a cluster translator. It is placed on top of DHT. Initially all files will be created as normal files, even up to a certain configurable size. The first block (default 4MB) will be stored like a normal file under its parent directory. However further blocks will be stored in a file, named by the GFID and block index in a separate namespace (like /.shard/GFID1.1, /.shard/GFID1.2 … /.shard/GFID1.N). File IO happening to a particular offset will write to the appropriate “piece file”, creating it if necessary. The aggregated file size and block count will be stored in the xattr of the original file.

Usage:

Here I have a 2×2 distributed-replicated volume.

# gluster volume info
Volume Name: dis-rep
Type: Distributed-Replicate
Volume ID: 96001645-a020-467b-8153-2589e3a0dee3
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: server1:/bricks/1
Brick2: server2:/bricks/2
Brick3: server3:/bricks/3
Brick4: server4:/bricks/4
Options Reconfigured:
performance.readdir-ahead: on

To enable sharding on it, this is what I do:

# gluster volume set dis-rep features.shard on
volume set: success

Now, to configure the shard block size to 16MB, this is what I do:

# gluster volume set dis-rep features.shard-block-size 16MB
volume set: success

How files are sharded:

Now I write 84MB of data into a file named ‘testfile’.

# dd if=/dev/urandom of=/mnt/glusterfs/testfile bs=1M count=84
84+0 records in
84+0 records out
88080384 bytes (88 MB) copied, 13.2243 s, 6.7 MB/s

Let’s check the backend to see how the file was sharded to pieces and how these pieces got distributed across the bricks:

# ls /bricks/* -lh
/bricks/1:
total 0

/bricks/2:
total 0

/bricks/3:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

/bricks/4:
total 17M
-rw-r–r–. 2 root root 16M Dec 24 12:36 testfile

So the file hashed to the second replica set (brick3 and brick4 which form a replica pair) and 16M in size. Where did the remaining 68MB worth of data go? To find out, let’s check the contents of the hidden directory .shard on all bricks:

# ls /bricks/*/.shard -lh
/bricks/1/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/2/.shard:
total 37M
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.1
-rw-r–r–. 2 root root  16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.3
-rw-r–r–. 2 root root 4.0M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.5

/bricks/3/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

/bricks/4/.shard:
total 33M
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.2
-rw-r–r–. 2 root root 16M Dec 24 12:36 bc19873d-7772-4803-898c-bf14ee1ff2bd.4

So, the file was basically split into 6 pieces: 5 of them residing in the hidden directory “/.shard” distributed across replica sets based on disk space availability and the file name hash, and the first block residing in its native parent directory. Notice how blocks 1 through 4 are all of size 16M and the last block (block-5) is 4M in size.

Now let’s do some math to see how ‘testfile’ was “sharded”:

The total size of the write was 84MB. And the configured block size in this case is 16MB. So (84MB divided by 16MB) = 5 with remainder = 4MB

So the file was basically broken into 6 pieces in all, with the last piece having 4MB of data and the rest of them 16MB in size.

Now when we view the file from the mount point, it would appear as one single file:

# ls -lh /mnt/glusterfs/
total 85M
-rw-r–r–. 1 root root 84M Dec 24 12:36 testfile

Notice how the file is shown to be of size 84MB on the mount point. Similarly, when the file is read by an application, the different pieces or ‘shards’ are stitched together and appropriately presented to the application as if there was no chunking done at all.

Advantages of sharding:

The advantage of sharding a file over striping it across a finite set of bricks are:

  • Data blocks are distributed by DHT in a “normal way”.
  • Adding servers can happen in any number (even one at a time) and DHT’s rebalance will spread out the “piece files” evenly.
  • Sharding provides better utilization of disk space. Now it is no longer necessary to have at least one brick of size X in order to accommodate a file of size X, where X is really large. Consider this example: A distribute volume is made up of 3 bricks of size 10GB, 20GB, 30GB. With this configuration, it is impossible to store a file greater than 30GB in size on this volume. Sharding eliminates this limitation. A file of upto 60GB size can be stored on this volume with sharding.
  • Self-healing of a large file is now more distributed into smaller files across more servers leading to better heal performance and lesser CPU usage, which is particularly a pain point for large file workloads.
  • piece file naming scheme is immune to renames and hardlinks.
  • When geo-replicating a large file to a remote volume, only the shards that changed can be synced to the slave, considerably reducing the sync time.
  • When sharding is used in conjunction with tiering, only the shards that change would be promoted/demoted. This reduces the amount of data that needs to be migrated between hot and cold tier.
  • When sharding is used in conjunction with bit-rot detection feature of GlusterFS, the checksum is computed on smaller shards as opposed to one large file.

Yes, sharding in its current form is not compatible with directory quota. This is something we are going to focus on, in the coming days – to make it compatible with other Gluster features (including directory quota and user/group quota which is a feature in design phase).

Thanks,
Krutika

时间: 2024-12-23 20:38:38

Introducing shard translator的相关文章

mongodb单机配置shard分片

##################################################### 操作系统: [[email protected] logs]# cat /etc/issue CentOS release 6.4 (Final) Kernel \r on an \m [[email protected] logs]# uname -r 2.6.32-358.el6.x86_64 mongodb版本: [[email protected] logs]# mongo -ve

Introducing the Filter Types

The ActionFilterAttribute class implements both the IActionFilter and IResultFilter interfaces. This class is abstract, which forces you to provide an implementation. The AuthorizeAttribute and HandleErrorAttribute classes contain useful features and

十三、File Translator怎么写

---恢复内容开始--- 1. File Translator可以将信息从maya中导入和导出. 2. 创建一个file translator需要从MPxFileTranslator继承. 3. 函数介绍: (1)::canBeOpened()方法决定了file translator是否可以打开文件,如果只是一个importer那么就return flase,反之. (2)importer:必须包含::haveReadMethod(), ::reader(). (3)exporter: 必须包含

mongoDB的读书笔记(via3.0)(05)_【Sharding】(03)_关于Shard Keys与Hash的理论知识小絮叨

紧急出个差,但是又不能不继续写,少写点,先把Shard key和Hash的一些小概念继续写一写. Shard Keys mongoDB的Sharding最重要的就是Shard Key.用Shard Key进行分片,并且按照指定的Chunk的大小进行Chunk的分割和移动(后面细说).Shard Key的一些特点: Shard Key不可以是multikey index 何为mutikey index?途中的addr[]这个数组就是所谓的mutikey index.Shard key可以是复数个字

[译]Introducing ASP.NET vNext and MVC 6

原文:http://www.infoq.com/news/2014/05/ASP.NET-vNext?utm_source=tuicool Part of the ASP.NET vNext initiative, ASP.NET MVC 6 represents a fundamental change to how Microsoft constructs and deploys web frameworks. The goal is to create a host agnostic fr

POJ 2121 Inglish-Number Translator

来源:http://poj.org/problem?id=2121 Inglish-Number Translator Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 4475   Accepted: 1747 Description In this problem, you will be given one or more integers in English. Your task is to translate t

NAT--Network Address Translator

定义 Nat用于在本地网络中使用私有地址,在连接互联网时转而使用全局IP地址的技术.除了转换IP地址外,还出现了可以转换TCP.UDP端口号的NAPT(Network Address Ports Translator)技术,由此可以实现用一个全局IP地址与多个主机的通信.通常人们提到的NAT,多半是指NAPT.NAPT也叫做IP伪装或Multi NAT. NAT(NAPT)实际上是为正在面临地址枯竭的IPv4而开发的技术.不过,在IPv6中为了提高网络安全也使用NAT,在IPv4和IPv6之间的

ElasticSearch 2 (10) - 在ElasticSearch之下(深入理解Shard和Lucene Index)

摘要 从底层介绍ElasticSearch Shard的内部原理,以及回答为什么使用ElasticSearch有必要了解Lucene的内部工作方式? 了解ElasticSearch API的代价 构建快速的搜索应用 不要任何时候都commit 何时使用Stored Fields和Document Values Lucene可能不是一个合适的工具 了解索引的存储方式 term vector是索引大小的1/2 我移除了20%的文件,但是索引占用空间并未发生任何变化 版本 elasticsearch版

Introducing the Microservices Reference Architecture from NGINX

Introducing the Microservices Reference Architecture from NGINX https://www.nginx.com/blog/introducing-the-nginx-microservices-reference-architecture/