关于ceph tier的一些想法

ceph的实验环境在公司内部用了一段时间，主要是利用rbd提供的块设备创建虚拟机、为虚拟机分配块，还是很稳定的。但现在的环境大部分配置还是ceph的默认值，只是将journal分离出来写到了一个单独的分区。后面打算利用ceph tier和ssd做一些优化：

1. 将journal写入一块单独的ssd磁盘。

2. 利用ssd配置一个ssd pool，将这个pool作为其它pool的cache，这就需要ceph tier。

网上搜索了一下，目前还没有这么实践的文章以及这么做后性能到底会提升多少。所以此方案实施后会进行相关测试：

1. 默认安装ceph。

2. 将journal分离到单独的普通硬盘分区。

3. 将journal分离到单独的ssd盘。

4. 加入ssd pool后。

crush的设置可以看这篇文章：http://www.sebastien-han.fr/blog/2012/12/07/ceph-2-speed-storage-with-crush/

I. Use case

Roughly say your infrastructure could be based on several type of servers:

storage nodes full of SSDs disks
storage nodes full of SAS disks
storage nodes full of SATA disks

Such handy mecanism is possible with the help of the CRUSH Map.

II. A bit about CRUSH

CRUSH stands for Controlled Replication Under Scalable Hashing:

Pseudo-random placement algorithm
Fast calculation, no lookup Repeatable, deterministic
Ensures even distribution
Stable mapping
Limited data migration
Rule-based configuration, rule determines data placement
Infrastructure topology aware, the map knows the structure of your infra (nodes, racks, row, datacenter)
Allows weighting, every OSD has a weight

For more details check the Ceph Official documentation.

III. Setup

What are we going to do?

Retrieve the current CRUSH Map
Decompile the CRUSH Map
Edit it. We will add 2 buckets and 2 rulesets
Recompile the new CRUSH Map.
Re-inject the new CRUSH Map.

III.1. Begin

Grab your current CRUSH map:

$ ceph osd getcrushmap -o ma-crush-map
$ crushtool -d ma-crush-map -o ma-crush-map.txt

For the sake of simplicity, let’s assume that you have 4 OSDs:

2 of them are SAS disks
2 of them are SSD enterprise

And here is the OSD tree:

$ ceph osd tree
dumped osdmap tree epoch 621
# id    weight  type name   up/down reweight
-1  12  pool default
-3  12      rack le-rack
-2  3           host ceph-01
0   1               osd.0   up  1
1   1               osd.1   up  1
-4  3           host ceph-02
2   1               osd.2   up  1
3   1               osd.3   up  1

III.2. Default crush map

Edit your CRUSH map:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host ceph-01 {
    id -2       # do not change unnecessarily
    # weight 3.000
    alg straw
    hash 0  # rjenkins1
    item osd.0 weight 1.000
    item osd.1 weight 1.000
}
host ceph-02 {
    id -4       # do not change unnecessarily
    # weight 3.000
    alg straw
    hash 0  # rjenkins1
    item osd.2 weight 1.000
    item osd.3 weight 1.000
}
rack le-rack {
    id -3       # do not change unnecessarily
    # weight 12.000
    alg straw
    hash 0  # rjenkins1
    item ceph-01 weight 2.000
    item ceph-02 weight 2.000
}
pool default {
    id -1       # do not change unnecessarily
    # weight 12.000
    alg straw
    hash 0  # rjenkins1
    item le-rack weight 4.000
}

# rules
rule data {
    ruleset 0
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule metadata {
    ruleset 1
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}
rule rbd {
    ruleset 2
    type replicated
    min_size 1
    max_size 10
    step take default
    step chooseleaf firstn 0 type host
    step emit
}

# end crush map

III.3. Add buckets and rules

Now we have to add 2 new specific rules:

one for the SSD pool
one for the SAS pool

III.3.1. SSD Pool

Add a bucket for the pool SSD:

pool ssd {
    id -5       # do not change unnecessarily
    alg straw
    hash 0  # rjenkins1
    item osd.0 weight 1.000
    item osd.1 weight 1.000
}

Add a rule for the bucket nearly created:

rule ssd {
    ruleset 3
    type replicated
    min_size 1
    max_size 10
    step take ssd
    step choose firstn 0 type host
    step emit
}

III.3.1. SAS Pool

Add a bucket for the pool SAS:

pool sas {
    id -6       # do not change unnecessarily
    alg straw
    hash 0  # rjenkins1
    item osd.2 weight 1.000
    item osd.3 weight 1.000
}

Add a rule for the bucket nearly created:

rule sas {
    ruleset 4
    type replicated
    min_size 1
    max_size 10
    step take sas
    step choose firstn 0 type host
    step emit
}

Eventually recompile and inject the new CRUSH map:

$ crushtool -c ma-crush-map.txt -o ma-nouvelle-crush-map
$ ceph osd setcrushmap -i ma-nouvelle-crush-map

III.3. Create and configure the pools

Create your 2 new pools:

$ rados mkpool ssd
successfully created pool ssd
$ rados mkpool sas
successfully created pool sas

Set the rule set to the pool:

ceph osd pool set ssd crush_ruleset 3
ceph osd pool set sas crush_ruleset 4

Check that the changes have been applied successfully:

$ ceph osd dump | grep -E ‘ssd|sas‘
pool 3 ‘ssd‘ rep size 2 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 21 owner 0
pool 4 ‘sas‘ rep size 2 crush_ruleset 4 object_hash rjenkins pg_num 128 pgp_num 128 last_change 23 owner 0

Just create some random files and put them into your object store:

$ dd if=/dev/zero of=ssd.pool bs=1M count=512 conv=fsync
$ dd if=/dev/zero of=sas.pool bs=1M count=512 conv=fsync
$ rados -p ssd put ssd.pool ssd.pool.object
$ rados -p sas put sas.pool sas.pool.object

Where are pg active?

$ ceph osd map ssd ssd.pool.object
osdmap e260 pool ‘ssd‘ (3) object ‘ssd.pool.object‘ -> pg 3.c5034eb8 (3.0) -> up [1,0] acting [1,0]

$ ceph osd map sas sas.pool.object
osdmap e260 pool ‘sas‘ (4) object ‘sas.pool.object‘ -> pg 4.9202e7ee (4.0) -> up [3,2] acting [3,2]

CRUSH Rules! As you can see from this article CRUSH allows you to perform amazing things. The CRUSH Map could be very complex, but it brings a lot of flexibility! Happy CRUSH Mapping ;-)

关于ceph tier的一些想法

时间： 2024-10-29 19:06:20

关于ceph tier的一些想法的相关文章

Ceph Cache Tier

CacheTier是ceph服务端缓存的一种方案,简单来说就是加一层Cache层,客户端直接跟Cache层打交道,提高访问速度,后端有一个存储层,实际存储大批量的数据. 分层存储的原理,就是存储的数据的访问是有热点的,数据并非均匀访问.有个通用法则叫做二八原则,也就是80%的应用只访问20%的数据,这20%的数据成为热点数据,如果把这些热点数据保存性能比较高的SSD磁盘上,就可以提高响应时间. 性能较高的存储,一般由SSD 磁盘组成,称之为Cache 层,hot层,Cache pool 或者 h

Ceph亚太地区路演首站总结及Ceph中国发展思考

2016年8月20日,Ceph亚太地区路演首站--北京站的活动如期在英特尔中国研究院举办,会议吸引了各路国内外"英雄好汉"齐聚一堂,场面异常火爆. 参照去年参加Ceph Day的惯例(http://www.csdn.net/article/2015-06-08/2824891)对今天的会议做一个的总结. 会议开始前,来自于Intel的Zhang Jian介绍了Intel对于Ceph社区的一些主要贡献. 第二个演讲主题是来自于Ceph社区的Patrick大叔,开讲前放了一个很炫酷的Dem

ceph cache pool配置

0.引入本文介绍如何配置cache pool tiering. cache pool的作用是提供可扩展的cache,用来缓存ceph的热点数据或者直接用来作为高速pool.如何建立一个cache pool:首先利用ssd盘做一个虚拟的bucket tree, 然后创建一个cache pool,设置其crush映射rule和相关配置,最后关联需要用到的pool到cache pool. 1.建立ssd bucket tree 这是新增ssd bucket(vrack)后的osd tree.其中os

Ceph架构及性能优化

对分布式存储系统的优化离不开以下几点: 1. 硬件层面硬件规划 SSD选择 BIOS设置 2. 软件层面 Linux OS Ceph Configurations PG Number调整 CRUSH Map 其他因素硬件层面 1. CPU ceph-osd进程在运行过程中会消耗CPU资源,所以一般会为每一个ceph-osd进程绑定一个CPU核上. ceph-mon进程并不十分消耗CPU资源,所以不必为ceph-mon进程预留过多的CPU资源. ceph-msd也是非常消耗CPU资源的,所以

ceph command

General usage: ============== usage: ceph [-h] [-c CEPHCONF] [-i INPUT_FILE] [-o OUTPUT_FILE] [--id CLIENT_ID] [--name CLIENT_NAME] [--cluster CLUSTER] [--admin-daemon ADMIN_SOCKET] [--admin-socket ADMIN_SOCKET_NOPE]

Ceph：一个开源的 Linux PB 级分布式文件系统

探索 Ceph 文件系统和生态系统 M. Tim Jones , 自由作家简介: Linux®持续不断进军可扩展计算空间,特别是可扩展存储空间.Ceph 最近才加入到 Linux 中令人印象深刻的文件系统备选行列,它是一个分布式文件系统,能够在维护 POSIX 兼容性的同时加入了复制和容错功能.探索 Ceph 的架构,学习它如何提供容错功能,简化海量数据管理. 标记本文! 发布日期: 2010 年 6 月 12 日级别: 中级其他语言版本: 英文访问情况 5726 次浏览建议

浅谈Ceph纠删码

目录第1章引言 1.1 文档说明 1.2 参考文档第2章纠删码概念和原理 2.1 概念 2.2 原理第3章 CEPH纠删码介绍 3.1 CEPH纠删码用途 3.2 CEPH纠删码库 3.3 CEPH纠删码数据存储 3.3.1 编码块读写 3.3.2 间断全写 3.4 使用范围 3.4.1 冷数据 3.4.2 廉价多数据中心存储第4章 CEPH纠删码实例 4.1 数据读写 4.2 纠删码池不支持部分功能 4.3 纠删码PROFILE 4.4 CECHE TIER弥补ERASURE的缺

CEPH Cache Tiering

Cache Tiering的基本思想是冷热数据分离,用相对快速/昂贵的存储设备如SSD盘,组成一个Pool来作为Cache层,后端用相对慢速/廉价的设备来组建冷数据存储池. Ceph Cache Tiering Agent处理缓存层和存储层的数据的自动迁移,对客户端透明操作透明.Cahe层有两种典型使用模式: 1)Writeback模式 Ceph客户端直接往Cache层写数据,写完立即返回,Agent再及时把数据迁移到冷数据池.当客户端取不在Cache层的冷数据时,Agent负责把冷数据迁移到

ceph命令

[email protected]:~$ ceph --help General usage: ==============usage: ceph [-h] [-c CEPHCONF] [-i INPUT_FILE] [-o OUTPUT_FILE] [--id CLIENT_ID] [--name CLIENT_NAME] [--cluster CLUSTER] [--admin-daemon ADMIN_SOCKET] [--admin-socket ADMIN_SOCKET_NOPE] [