6 Cluster membership changes
Up until now we have assumed that the cluster configuration (the set of servers participating in the consensus algorithm) is fixed. In practice, it will occasionally be necessary to change the configuration, for example to replace servers when they fail or to change the degree of replication. Although this can be done by taking the entire cluster off-line, updating configuration files, and then restarting the cluster, this would leave the cluster unavailable during the changeover. In addition, if there are any manual steps, they risk operator error. In order to avoid these issues, we decided to automate configuration changes and incorporate them into the Raft consensus algorithm.
For the configuration change mechanism to be safe, there must be no point during the transition where it is possible for two leaders to be elected for the same term. Unfortunately, any approach where servers switch directly from the old configuration to the new configuration is unsafe. It isn’t possible to atomically switch all of the servers at once, so the cluster can potentially split into two independent majorities during the transition (see Figure 10).
In order to ensure safety, configuration changes must use a two-phase approach. There are a variety of ways to implement the two phases. For example, some systems (e.g., [22]) use the first phase to disable the old configuration so it cannot process client requests; then the second phase enables the new configuration. In Raft the cluster first switches to a transitional configuration we call joint consensus; once the joint consensus has been committed, the system then transitions to the new configuration. The joint consensus combines both the old and new configurations:
- Log entries are replicated to all servers in both configurations.
- Any server from either configuration may serve as leader.
- Agreement (for elections and entry commitment) requires separate majorities from both the old and new configurations.
The joint consensus allows individual servers to transition between configurations at different times without compromising safety. Furthermore, joint consensus allows the cluster to continue servicing client requests throughout the configuration change.
Cluster configurations are stored and communicated using special entries in the replicated log; Figure 11 illustrates the configuration change process. When the leader receives a request to change the configuration from Cold to Cnew, it stores the configuration for joint consensus (Cold,new in the figure) as a log entry and replicates that entry using the mechanisms described previously. Once a given server adds the new configuration entry to its log, it uses that configuration for all future decisions (a server always uses the latest configuration in its log, regardless of whether the entry is committed). This means that the leader will use the rules of Cold,new to determine when the log entry for Cold,new is committed. If the leader crashes, a new leader may be chosen under either Cold or Cold,new, depending on whether the winning candidate has received Cold,new. In any case, Cnew cannot make unilateral decisions during this period.
Once Cold,new has been committed, neither Cold nor Cnew can make decisions without approval of the other, and the Leader Completeness Property ensures that only servers with the Cold,new log entry can be elected as leader. It is now safe for the leader to create a log entry describing Cnew and replicate it to the cluster. Again, this configuration will take effect on each server as soon as it is seen. When the new configuration has been committed under the rules of Cnew, the old configuration is irrelevant and servers not in the new configuration can be shut down. As shown in Figure 11, there is no time when Cold and Cnew can both make unilateral decisions; this guarantees safety.
There are three more issues to address for reconfiguration. The first issue is that new servers may not initially store any log entries. If they are added to the cluster in this state, it could take quite a while for them to catch up, during which time it might not be possible to commit new log entries. In order to avoid availability gaps, Raft introduces an additional phase before the configuration change, in which the new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, the reconfiguration can proceed as described above.
The second issue is that the cluster leader may not be part of the new configuration. In this case, the leader steps down (returns to follower state) once it has committed the Cnew log entry. This means that there will be a period of time (while it is committing Cnew) when the leader is managing a cluster that does not include itself; it replicates log entries but does not count itself in majorities. The leader transition occurs when Cnew is committed because this is the first point when the new configuration can operate independently (it will always be possible to choose a leader from Cnew). Before this point, it may be the case that only a server from Cold can be elected leader.
The third issue is that removed servers (those not in Cnew) can disrupt the cluster. These servers will not receive heartbeats, so they will time out and start new elections. They will then send RequestVote RPCs with new term numbers, and this will cause the current leader to revert to follower state. A new leader will eventually be elected, but the removed servers will time out again and the process will repeat, resulting in poor availability.
To prevent this problem, servers disregard RequestVote RPCs when they believe a current leader exists. Specifically, if a server receives a RequestVote RPC within the minimum election timeout of hearing from a current leader, it does not update its term or grant its vote. This does not affect normal elections, where each server waits at least a minimum election timeout before starting an election. However, it helps avoid disruptions from removed servers: if a leader is able to get heartbeats to its cluster, then it will not be deposed by larger term numbers.
6 集群成员关系的改变
到目前为止,我们假定集群配置(一组服务器参与一致性算法)是固定的。实际上,偶尔是需要改变配置的,例如当服务器挂了后改变复制程度时取代他们。虽然这可以通过将整个集群离线,更新配置文件,然后重启集群来完成,但这将导致在集群转换期间系统不可用。另外,任何步骤中错误操作的风险大增。
为了保证配置变更机制是安全的,如果它可能导致在转换期间有两个leader在同一term中当选,那就变得没有任何意义了。不幸的是,任何服务器直接从旧配置切换到新配置的策略都是不安全的。一次性原子切换所有服务器是不可能的,所以集群可以在切换期间分为两个独立的部分(见图10)。
为了确保安全,配置切换必须采取二阶段的策略。有很多方法来实现二阶段。例如,一些系统(如[22])采用一阶段禁用旧配置,因此它不能处理客户端的请求;然后二阶段采用新配置。Raft中,集群首先切换到我们叫做joint consensus的过渡配置;一旦joint consensus提交完成,系统就过渡到了新配置。joint consensus结合了旧的和新的配置:
- 两种配置下都将日志复制到所有服务器。
- 任何一种配置下的任何一台服务器都可以充当leader。
- 协议(选举和条目提交)不管旧的还是新的配置都需要分离大部分服务器。
joint consensus运行个别的服务器在不同时间不影响安全性的前提下切换配置。此外,joint consensus运行集群在配置更改期间继续服务客户端请求。
集群配置使用特殊的已复制日志条目储存和通信;图11展示了配置更新的过程。当leader接收到配置从Cold到Cnew切换的请求时,作为一个条目储存joint consensus的配置(图中的Cold,new),并用之前描述的机制进行复制。一旦给定的服务器增加了新配置条目到它的日志中,它使用未来决策的配置(服务器总是在它的日志中使用最新的配置,无论条目是否被提交)。这意味着leader将会采用Cold,new的规则来决定Cold,new的日志条目何时被提交。如果leader崩溃了,一个新的leader将可能在Cold或Cold,new下当选,而这取决于获胜的candidate是否接收到了Cold,new。任何情况下,Cnew不能在此期间单方面做决定。
一旦Cold,new被提交,无论Cold还是Cnew都无法在另一方没同意的情况下做出决定了,the Leader Completeness Property确保了只有Cold,new下的服务器日志条目可以当选成为leader。现在对于leader创建新的日志条目描述Cnew和将之复制到集群是安全的。再次,配置就能起到所见即所得的效果了。当新的配置在Cnew规则下被提交,Cold配置就无关紧要了,并且不在新配置下的服务就会可以关闭。如图11所示,不存在Cold和Cnew同时单方面决定的时候;这保证了安全性。
重新配置还有3个需要解决的问题。第一个问题是新的服务器开始可能没存任何日志条目。如果他们是这种状态下接入集群的,可能就要花好一会儿来追上进度了,在这期间它就可能无法提交新的日志条目了。为了避免可能性差距,Raft在配置更新前引入了额外的一个阶段,新服务器作为非选民加入集群(leader将日志条目复制给他们,但他们没有选举权)。一旦新服务器赶上了其他服务器的进度,重新配置就可以如上所示的进行了。
第二个问题是集群的leader可能不是新配置的那部分。leader一旦提交完Cnew日志条目就下台(返回follower状态)。这意味着有一段时间(它在提交Cnew条目时)当leader在管理集群时不包含自己;它复制日志条目,但不将自己计算到大多数中。当Cnew因为新配置可以独立地操作而被提交(总是从Cnew中选取leader),leader才进行过渡。在这之前,可能会有只有一个Cold的服务器可以当选leader。
第三个问题是去除服务器(不在Cnew的那些)可能破坏集群。这些服务器将接收不到心跳,他们就会超时并启动一次新的选举。他们将会发送新的term值的RequestVote RPC,这会导致现任leader恢复follower状态。一个新的leader当选,但去除服务器会再次超时,并且这个过程会再一次重复,导致可用性变差。