The Ceph storage cluster
Ceph的集群由两种类型的守护进程组(daemon)成(通常是一个host一个daemon):
Ceph Monitor & Ceph OSD Daemon
一个ceph的monitor维护了整个cluster map,多个monitor组成的集群可以避免一个Monitor崩溃的单点失效问题。存储集群的用户可以从monitor处拷贝一份cluster map作为缓存。
一个Ceph OSD Daemon检查它自己以及其他OSD的状态并且报告给monitor。
存储集群的Client和每个Ceph OSD Daemon使用crush算法来高效地计算关于数据位置的信息,而不是通过一个巨大的中心化的搜索表。
1. 存储数据
Ceph从Ceph client处获取数据——无论这些数据是从Ceph块设备(Ceph Block Device),Ceph对象存储(Ceph Object Storage),Ceph文件系统(Ceph Filesystem)还是一个你使用librados创建的自定义存储方式——并且将它们作为对象存储起来。每个对象对应着文件系统中的一个文件,这些文件存储在Ceph存储设备(OSD)中。Ceph OSD守护进程负责处理在磁盘上的读写操作。
Ceph OSD Daemon将所有的数据作为对象存储在一个平坦的空间内(即没有目录那种层次结构)。一个对象包括一个标识符,二进制数据,以及由name/value对组成的元数据组成。数据的语义完全由Ceph client决定。比如Ceph Filesystem会使用元数据存储文件属性,比如文件的拥有者,创建时间,上一次修改时间等等。
注意,一个对象的ID在整个集群中都是独一无二的,而不是仅仅局限在本地文件系统中。
2. 可扩展性与高可用性
在传统的架构当中,客户端是通过访问一个中心化的组建来访问整个复杂的子系统,这种方式很容易造成单点失效问题,已经性能和可扩展性的瓶颈问题。Ceph则采用了去中心化的思想,让客户端直接与OSD Daemon交换信息。Ceph会在多个节点上保持同一个数据的备份,来保证可用性。Monitor也使用了多个节点构成一个集群,来保证可用性。
为了实现去中心化,Ceph使用了CRUSH算法。
CRUSH介绍
Ceph Client和Ceph OSD Daemon都使用crush算法来高效地计算对象位置的信息,而不是依赖于一个中心化的查询表。对于crush的详细介绍,可以看 CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data这篇论文。
Cluster Map
Ceph clients和Ceph OSD Daemons都拥有关于集群的拓扑(topology)信息,Ceph依赖于这些信息。这些拓扑信息包含了五个map,分别为:
- The Monitor Map: Contains the cluster fsid, the position, name address and port of each monitor. It also indicates the current epoch, when the map was created, and the last time it changed. To view a monitor map, execute ceph mon dump.
- The OSD Map: Contains the cluster fsid, when the map was created and last modified, a list of pools, replica sizes, PG numbers, a list of OSDs and their status (e.g., up, in). To view an OSD map, execute ceph osd dump.
- The PG Map: Contains the PG version, its time stamp, the last OSD map epoch, the full ratios, and details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean), and data usage statistics for each pool.
- The CRUSH Map: Contains a list of storage devices, the failure domain hierarchy (e.g., device, host, rack, row, room, etc.), and rules for traversing the hierarchy when storing data. To view a CRUSH map, execute ceph osd getcrushmap -o {filename}; then, decompile it by executing crushtool -d {comp-crushmap-filename} -o{decomp-crushmap-filename}. You can view the decompiled map in a text editor or with cat.
- The MDS Map: Contains the current MDS map epoch, when the map was created, and the last time it changed. It also contains the pool for storing metadata, a list of metadata servers, and which metadata servers are up and in. To view an MDS map, execute ceph mdsdump.
这个五个map被合称为cluster map。
每个map都维护了一个它自身操作状态变化的历史记录。Ceph Monitor维护了一个clsuter map的主要备份。
3. 高可用性Monitor
在Ceph客户端读写数据之前,客户端必须联系monitor并从其获取最近的cluster map的一份拷贝。为了避免单点失效问题(当monitor失效的时候,客户端无法进行读写),Ceph支持monitor构成的集群。当一个或者多个monitor宕机的时候,Ceph并不会整体失效。
4. 高可用性认证
为了辨别用户和保护系统不受中间人(man-in-the-middle)攻击,Ceph提供cephx认证系统来认证用户和daemon。注意cephx系统并不解决在传输或者其他过程中的数据加密过程。
Cephx使用共享密钥来进行认证,这意味着客户端和monitor集群都拥有客户的密钥的一份拷贝。这个认证协议允许双方能够向对方证明自己的身份,而不用泄漏这个密钥。
由于Ceph是可以扩展的,Ceph从设计上就避免去通过一个中心化的接口去访问Ceph对象存储设施,这意味这Ceph客户端能够直接访问OSD。Cephx协议的运转方式类似于Kerberos。
A user/actor invokes a Ceph client to contact a monitor. Unlike Kerberos, each monitor can authenticate users and distribute keys, so there is no single point of failure or bottleneck when usingcephx. The monitor returns an authentication data structure similar to a Kerberos ticket that contains a session key for use in obtaining Ceph services. This session key is itself encrypted with the user’s permanent secret key, so that only the user can request services from the Ceph monitor(s). The client then uses the session key to request its desired services from the monitor, and the monitor provides the client with a ticket that will authenticate the client to the OSDs that actually handle data. Ceph monitors and OSDs share a secret, so the client can use the ticket provided by the monitor with any OSD or metadata server in the cluster. Like Kerberos, cephx tickets expire, so an attacker cannot use an expired ticket or session key obtained surreptitiously. This form of authentication will prevent attackers with access to the communications medium from either creating bogus messages under another user’s identity or altering another user’s legitimate messages, as long as the user’s secret key is not divulged before it expires.
为了使用cephx,一个管理员必须首先设置用户。在下列图当中,client.admin用户(应该是位于客户端)发起一个ceph auth get-or-create-key指令来生成一个用户名和密钥。Ceph的auth子系统生成一个用户名和密钥后,会在monitor上存储一份拷贝,并传输回给client.admin用户。这意味着客户端和monitor共享了一个密钥。
To authenticate with the monitor, the client passes in the user name to the monitor, and the monitor generates a session key and encrypts it with the secret key associated to the user name. Then, the monitor transmits the encrypted ticket back to the client. The client then decrypts the payload with the shared secret key to retrieve the session key. The session key identifies the user for the current session. The client then requests a ticket on behalf of the user signed by the session key. The monitor generates a ticket, encrypts it with the user’s secret key and transmits it back to the client. The client decrypts the ticket and uses it to sign requests to OSDs and metadata servers throughout the cluster.
Cephx协议对客户端和Ceph server之间的正在进行的通信都会进行认证。在初始认证之后,客户端和服务器之间的每一个消息,都会被ticket签名。