当kudu有tserver下线或者迁移之后,旧的tserver会一直以dead状态出现,并且tserver日志中会有大量的连接重试日志,一天的错误日志会有几个G,
W0322 22:13:59.202749 16927 tablet_service.cc:290] Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local UUID: e2f80a1fcf0c47f6b7f220a44d69297f. Requested UUID: 45bfb5b3e3ff41d9b1b1d2afab78d65c: from {username=‘kudu‘} at 192.168.0.1:34724: tablet_id: "9933f18e59554ae6b5354e2a948469e9" caller_uuid: "9b164f37d04a484c8634ea86eae1b048" caller_term: 3 preceding_id { term: 2 index: 1873 } ops { id { term: 3 index: 1874 } timestamp: 6359719759241142272 op_type: NO_OP noop_request { } } dest_uuid: "45bfb5b3e3ff41d9b1b1d2afab78d65c" committed_index: 1874 all_replicated_index: 0 safe_timestamp: 6359719761707556864 last_idx_appended_to_leader: 1874
这时如果想要把这些dead状态的tserver去掉,并没有直接的命令,官方给出的方法如下:
Kudu does not currently have an automated way to remove a tablet server from a cluster permanently. Instead, use the following steps:
- 1 Ensure the cluster is in good health using ksck. See Checking Cluster Health with ksck.
- 首先保证集群是健康的(通过ksck命令)
- 2 If the tablet server contains any replicas of tables with replication factor 1, these replicas must be manually moved off the tablet server prior to shutting it down. The kudu tablet change_config move_replica tool can be used for this.
- 将dead状态的server上的副本进行迁移,如果有replication factor设置为1的数据,必须在下线前手工移动数据;
- 3 Shut down the tablet server. After -follower_unavailable_considered_failed_sec, which defaults to 5 minutes, Kudu will begin to re-replicate the tablet server’s replicas to other servers. Wait until the process is finished. Progress can be monitored using ksck.
- 只要tserver处于下线状态超过5分钟以上会自动进行副本迁移;
- 4 Once all the copies are complete, ksck will continue to report the tablet server as unavailable. The cluster will otherwise operate fine without the tablet server. To completely remove it from the cluster so ksck shows the cluster as completely healthy, restart the masters. In the case of a single master, this will cause cluster downtime. With multimaster, restart the masters in sequence to avoid cluster downtime.
- 当所有副本都迁移完之后,ksck依然会显示有tserver不可用,如果想完全去掉这些dead状态的server,需要重启master;
Do not shut down multiple tablet servers at once. To remove multiple tablet servers from the cluster, follow the above instructions for each tablet server, ensuring that the previous tablet server is removed from the cluster and ksck is healthy before shutting down the next.
最后,重启master之后在保证集群健康的前提下逐一重启tserver;
参考:https://kudu.apache.org/docs/administration.html#tablet_server_decommissioning
原文地址:https://www.cnblogs.com/barneywill/p/10581678.html