k8s实践9:一次失败的kubernetes集群崩溃处理记录

1.
莫名其妙的整个集群崩溃,所有命令无法执行,所有组件(controller-manager和scheduler两个正常)都是启动失败.各种记录和报错,参考见下:

[[email protected] ~]# kubectl get cs
error: the server doesn‘t have a resource type "cs"
[[email protected] ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
? Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
? Active: activating (start) since Mon 2019-04-08 11:26:35 CST; 37s ago
Main PID: 6691 (flanneld)
? Memory: 10.6M
? CGroup: /system.slice/flanneld.service
? ? ? ? ? └─6691 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flanneld/...

Apr 08 11:26:45 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:46 k8s-master2 flanneld[6691]: timed out
Apr 08 11:26:56 k8s-master2 flanneld[6691]: E0408 11:26:56.506816? ? 6691 main.go:349] Couldn‘t fetch netw...used
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:26:56 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:26:57 k8s-master2 flanneld[6691]: timed out
Apr 08 11:27:07 k8s-master2 flanneld[6691]: E0408 11:27:07.511956? ? 6691 main.go:349] Couldn‘t fetch netw...used
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #1: net/http: TLS handshake timeout
Apr 08 11:27:07 k8s-master2 flanneld[6691]: ; error #2: dial tcp 192.168.32.129:2379: getsockopt: connecti...used
Apr 08 11:27:08 k8s-master2 flanneld[6691]: timed out
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
? Loaded: loaded (/etc/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
? Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:30:01 CST; 4s ago
? ? Docs: https://github.com/GoogleCloudPlatform/kubernetes
? Process: 7348 ExecStart=/opt/k8s/bin/kube-apiserver --enable-admission-plugins=Initializers,NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota --anonymous-auth=false --experimental-encryption-provider-config=/etc/kubernetes/encryption-config.yaml --advertise-address=192.168.32.129 --bind-address=192.168.32.129 --insecure-port=0 --authorization-mode=Node,RBAC --runtime-config=api/all --enable-bootstrap-token-auth --service-cluster-ip-range=10.254.0.0/16 --service-node-port-range=8400-9000 --tls-cert-file=/etc/kubernetes/cert/kubernetes.pem --tls-private-key-file=/etc/kubernetes/cert/kubernetes-key.pem --client-ca-file=/etc/kubernetes/cert/ca.pem --kubelet-client-certificate=/etc/kubernetes/cert/kubernetes.pem --kubelet-client-key=/etc/kubernetes/cert/kubernetes-key.pem --service-account-key-file=/etc/kubernetes/cert/ca-key.pem --etcd-cafile=/etc/kubernetes/cert/ca.pem --etcd-certfile=/etc/kubernetes/cert/kubernetes.pem --etcd-keyfile=/etc/kubernetes/cert/kubernetes-key.pem --etcd-servers=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 --enable-swagger-ui=true --allow-privileged=true --apiserver-count=3 --audit-log-maxage=30 --audit-log-maxbackup=3 --audit-log-maxsize=100 --audit-log-path=/var/log/kube-apiserver-audit.log --event-ttl=1h --alsologtostderr=true --logtostderr=false --log-dir=/var/log/kubernetes --v=2 (code=exited, status=255)
Main PID: 7348 (code=exited, status=255)
? Memory: 0B
? CGroup: /system.slice/kube-apiserver.service

Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service: main process exited, code=exited, status=255/n/a
Apr 08 11:30:01 k8s-master2 systemd[1]: Failed to start Kubernetes API Server.
Apr 08 11:30:01 k8s-master2 systemd[1]: Unit kube-apiserver.service entered failed state.
Apr 08 11:30:01 k8s-master2 systemd[1]: kube-apiserver.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# systemctl status etcd
● etcd.service - Etcd Server
? Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
? Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 11:28:02 CST; 358ms ago
? ? Docs: https://github.com/coreos
? Process: 7001 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master2 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.129:2380 --initial-advertise-peer-urls=https://192.168.32.129:2380 --listen-client-urls=https://192.168.32.129:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.129:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 7001 (code=exited, status=2)

Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 11:28:02 k8s-master2 systemd[1]: Failed to start Etcd Server.
Apr 08 11:28:02 k8s-master2 systemd[1]: Unit etcd.service entered failed state.
Apr 08 11:28:02 k8s-master2 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

2.
问题排查思路
集群最核心的数据是etcd.flannel网络数据存储在etcd,集群其他各种数据也全部存储在etcd.
集群组件通过kube-apiserver来读取etcd数据.
先从处理etcd开始,把etcd启动成功.

重启etcd报错,见下:

[[email protected] ~]# systemctl daemon-reload && systemctl restart etcd
Job for etcd.service failed because the control process exited with error code. See "systemctl status etcd.service" and "journalctl -xe" for details.
[[email protected] ~]# systemctl status etcd.service
● etcd.service - Etcd Server
? Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
? Active: activating (auto-restart) (Result: exit-code) since Mon 2019-04-08 13:47:20 CST; 1s ago
? ? Docs: https://github.com/coreos
? Process: 25019 ExecStart=/opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master3 --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-cert-file=/etc/etcd/cert/etcd.pem --peer-key-file=/etc/etcd/cert/etcd-key.pem --peer-trusted-ca-file=/etc/kubernetes/cert/ca.pem --peer-client-cert-auth --client-cert-auth --listen-peer-urls=https://192.168.32.130:2380 --initial-advertise-peer-urls=https://192.168.32.130:2380 --listen-client-urls=https://192.168.32.130:2379,http://127.0.0.1:2379 --advertise-client-urls=https://192.168.32.130:2379 --initial-cluster-token=etcd-cluster-0 --initial-cluster=k8s-master1=https://192.168.32.128:2380,k8s-master2=https://192.168.32.129:2380,k8s-master3=https://192.168.32.130:2380 --initial-cluster-state=new (code=exited, status=2)
Main PID: 25019 (code=exited, status=2)

Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 13:47:20 k8s-master3 systemd[1]: Failed to start Etcd Server.
Apr 08 13:47:20 k8s-master3 systemd[1]: Unit etcd.service entered failed state.
Apr 08 13:47:20 k8s-master3 systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# journalctl -f etcd
Failed to add match ‘etcd‘: Invalid argument
Failed to add filters: Invalid argument
[[email protected] ~]# journalctl -u etcd
-- Logs begin at Mon 2019-04-08 12:10:25 CST, end at Mon 2019-04-08 13:47:42 CST. --
Apr 08 12:10:25 k8s-master3 systemd[1]: Starting Etcd Server...
Apr 08 12:10:25 k8s-master3 etcd[4390]: etcd Version: 3.3.7
Apr 08 12:10:25 k8s-master3 etcd[4390]: Git SHA: 56536de55
Apr 08 12:10:25 k8s-master3 etcd[4390]: Go Version: go1.9.6
Apr 08 12:10:25 k8s-master3 etcd[4390]: Go OS/Arch: linux/amd64
Apr 08 12:10:25 k8s-master3 etcd[4390]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Apr 08 12:10:25 k8s-master3 etcd[4390]: the server is already initialized as member before, starting as etcd memb
Apr 08 12:10:25 k8s-master3 etcd[4390]: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pe
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for peers on https://192.168.32.130:2380
Apr 08 12:10:25 k8s-master3 etcd[4390]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cer
Apr 08 12:10:25 k8s-master3 etcd[4390]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for client requests on 127.0.0.1:2379
Apr 08 12:10:25 k8s-master3 etcd[4390]: listening for client requests on 192.168.32.130:2379
Apr 08 12:10:25 k8s-master3 etcd[4390]: recovered store from snapshot at index 3200034
Apr 08 12:10:25 k8s-master3 etcd[4390]: recovering backend from snapshot error: database snapshot file path error
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: recovering backend from snapshot error: database snapshot file pat
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: runtime error: invalid memory address or nil pointer dereference
Apr 08 12:10:25 k8s-master3 etcd[4390]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Apr 08 12:10:25 k8s-master3 etcd[4390]: goroutine 1 [running]:
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewSe
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic(0xe1d2e0, 0xc4203eb9c0)
Apr 08 12:10:25 k8s-master3 etcd[4390]: /usr/local/go/src/runtime/panic.go:491 +0x283
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*Packag
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewSe
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEt
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEt
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 etcd[4390]: main.main()
Apr 08 12:10:25 k8s-master3 etcd[4390]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/et
Apr 08 12:10:25 k8s-master3 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 08 12:10:25 k8s-master3 systemd[1]: Failed to start Etcd Server.
Apr 08 12:10:25 k8s-master3 systemd[1]: Unit etcd.service entered failed state.
Apr 08 12:10:25 k8s-master3 systemd[1]: etcd.service failed.

主要报错:

Apr 08 12:10:25 k8s-master3 etcd[4390]: recovering backend from snapshot error: database snapshot file path error
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: recovering backend from snapshot error: database snapshot file pat
Apr 08 12:10:25 k8s-master3 etcd[4390]: panic: runtime error: invalid memory address or nil pointer dereference

3.
解决方法和思路:
删除所有etcd数据,重新初始化.

[[email protected] ~]# rm -rf /var/lib/etcd/*
[[email protected] ~]# systemctl daemon-reload && systemctl restart etcd
[[email protected] ~]# systemctl status etcd.service
● etcd.service - Etcd Server
? Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
? Active: active (running) since Mon 2019-04-08 13:57:39 CST; 9s ago
? ? Docs: https://github.com/coreos
Main PID: 27342 (etcd)
? Memory: 34.7M
? CGroup: /system.slice/etcd.service
? ? ? ? ? └─27342 /opt/k8s/bin/etcd --data-dir=/var/lib/etcd --name=k8s-master3 --cert-file=/etc/etcd/cert/et...

Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [term 4] received MsgTimeoutNow from 5bac98ba27...ship.
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f became candidate at term 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f received MsgVoteResp from bd8793282fb7e56f at term 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [logterm: 4, index: 43] sent MsgVote request to...erm 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: bd8793282fb7e56f [logterm: 4, index: 43] sent MsgVote request to...erm 5
Apr 08 13:57:46 k8s-master3 etcd[27342]: raft.node: bd8793282fb7e56f lost leader 5bac98ba2781a51e at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f received MsgVoteResp from 5bac98ba2781a51e at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f [quorum:2] has received 2 MsgVoteResp votes and...tions
Apr 08 13:57:48 k8s-master3 etcd[27342]: bd8793282fb7e56f became leader at term 5
Apr 08 13:57:48 k8s-master3 etcd[27342]: raft.node: bd8793282fb7e56f elected leader bd8793282fb7e56f at term 5
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 member list
5bac98ba2781a51e: name=k8s-master2 peerURLs=https://192.168.32.129:2380 clientURLs=https://192.168.32.129:2379 isLeader=true
bd8793282fb7e56f: name=k8s-master3 peerURLs=https://192.168.32.130:2380 clientURLs=https://192.168.32.130:2379 isLeader=false
bee1cc9618cefbee: name=k8s-master1 peerURLs=https://192.168.32.128:2380 clientURLs=https://192.168.32.128:2379 isLeader=false
[[email protected] ~]#

4.
etcd集群正常后,命令可以执行.
但是所有数据丢失

[[email protected] etcd]# kubectl get all
NAME? ? ? ? ? ? ? ?? TYPE? ? ? ? CLUSTER-IP?? EXTERNAL-IP?? PORT(S)?? AGE
service/kubernetes?? ClusterIP?? 10.254.0.1?? <none>? ? ? ? 443/TCP?? 45m
[[email protected] etcd]# kubectl get all -n kube-system
No resources found.
[[email protected] etcd]# kubectl get cs
NAME? ? ? ? ? ? ? ?? STATUS? ? MESSAGE? ? ? ? ? ?? ERROR
controller-manager?? Healthy?? ok? ? ? ? ? ? ? ? ?
scheduler? ? ? ? ? ? Healthy?? ok? ? ? ? ? ? ? ? ?
etcd-0? ? ? ? ? ? ?? Healthy?? {"health":"true"}??
etcd-1? ? ? ? ? ? ?? Healthy?? {"health":"true"}??
etcd-2? ? ? ? ? ? ?? Healthy?? {"health":"true"}??
[[email protected] etcd]#

api恢复正常

[[email protected] etcd]# systemctl status kube-apiserver
● kube-apiserver.service - Kubernetes API Server
?? Loaded: loaded (/etc/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
?? Active: active (running) since Mon 2019-04-08 13:57:56 CST; 46min ago
? ?? Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 18534 (kube-apiserver)
?? Memory: 223.9M
?? CGroup: /system.slice/kube-apiserver.service
? ? ? ? ?? └─18534 /opt/k8s/bin/kube-apiserver --enable-admission-plugins=Initializers,NamespaceLifecycle,...

Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.174401?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.254198?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.270269?? 18534 storage_rbac.go:215] c...tor
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.292810?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.305491?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.319763?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.339471?? 18534 storage_rbac.go:215] c...ler
Apr 08 13:58:18 k8s-master1 kube-apiserver[18534]: I0408 13:58:18.642901?? 18534 controller.go:608] quo...es}
Apr 08 13:58:19 k8s-master1 kube-apiserver[18534]: I0408 13:58:19.011673?? 18534 controller.go:608] quo...gs}
Apr 08 13:58:19 k8s-master1 kube-apiserver[18534]: I0408 13:58:19.112267?? 18534 controller.go:608] quo...ts}
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] etcd]#

5.
其他组件
因为数据全部丢失,所以flanneld无法获取到etcd的网络记录数据,无法启动flanneld
因为flanneld无法启动,所以docker和kubelet也无法启动.
见下:

[[email protected] ~]# systemctl status flanneld -l
● flanneld.service - Flanneld overlay address etcd agent
?? Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
?? Active: activating (start) since Mon 2019-04-08 14:49:35 CST; 1min 1s ago
Main PID: 2194 (flanneld)
?? Memory: 40.5M
?? CGroup: /system.slice/flanneld.service
? ? ? ? ?? └─2194 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flanneld/cert/flanneld.pem -etcd-keyfile=/etc/flanneld/cert/flanneld-key.pem -etcd-endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 -etcd-prefix=/kubernetes/network -iface=ens33

Apr 08 14:50:32 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:32 k8s-master1 flanneld[2194]: E0408 14:50:32.209982? ? 2194 main.go:349] Couldn‘t fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:33 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:33 k8s-master1 flanneld[2194]: E0408 14:50:33.215958? ? 2194 main.go:349] Couldn‘t fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:34 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:34 k8s-master1 flanneld[2194]: E0408 14:50:34.220404? ? 2194 main.go:349] Couldn‘t fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:35 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:35 k8s-master1 flanneld[2194]: E0408 14:50:35.224053? ? 2194 main.go:349] Couldn‘t fetch network config: 100: Key not found (/kubernetes) [14]
Apr 08 14:50:36 k8s-master1 flanneld[2194]: timed out
Apr 08 14:50:36 k8s-master1 flanneld[2194]: E0408 14:50:36.237390? ? 2194 main.go:349] Couldn‘t fetch network config: 100: Key not found (/kubernetes) [14]
[[email protected] ~]# systemctl status docker
● docker.service - Docker Application Container Engine
?? Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
?? Active: inactive (dead)
? ?? Docs: https://docs.docker.com
[[email protected] ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
?? Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
?? Active: inactive (dead)
? ?? Docs: https://github.com/GoogleCloudPlatform/kubernetes

Apr 08 14:51:05 k8s-master1 systemd[1]: Dependency failed for Kubernetes Kubelet.
Apr 08 14:51:05 k8s-master1 systemd[1]: Job kubelet.service/start failed with result ‘dependency‘.
[[email protected] ~]#

重新把flanneld的网络信息写入etcd,见下:

[[email protected] ~]# source /opt/k8s/bin/environment.sh
[[email protected] ~]# echo ${ETCD_ENDPOINTS}
https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379
[[email protected] ~]#?
[[email protected] ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem set ${FLANNEL_ETCD_PREFIX}/config ‘{"Network":"‘${CLUSTER_CIDR}‘",
"SubnetLen": 24, "Backend": {"Type": "vxlan"}}‘
{"Network":"172.30.0.0/16",
"SubnetLen": 24, "Backend": {"Type": "vxlan"}}
[[email protected] ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls
/kubernetes
[[email protected] ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes
/kubernetes/network
[[email protected] ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes/network
/kubernetes/network/config
/kubernetes/network/subnets
[[email protected] ~]# etcdctl --endpoints=${ETCD_ENDPOINTS} --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/flanneld/cert/flanneld.pem --key-file=/etc/flanneld/cert/flanneld-key.pem ls /kubernetes/network/subnets
/kubernetes/network/subnets/172.30.45.0-24
/kubernetes/network/subnets/172.30.79.0-24
/kubernetes/network/subnets/172.30.96.0-24
/kubernetes/network/subnets/172.30.27.0-24
[[email protected] ~]#

写入后,服务全部起来了

[[email protected] ~]# systemctl status flanneld
● flanneld.service - Flanneld overlay address etcd agent
?? Loaded: loaded (/etc/systemd/system/flanneld.service; enabled; vendor preset: disabled)
?? Active: active (running) since Mon 2019-04-08 14:56:11 CST; 1min 41s ago
? Process: 3616 ExecStartPost=/opt/k8s/bin/mk-docker-opts.sh -k DOCKER_NETWORK_OPTIONS -d /run/flannel/docker (code=exited, status=0/SUCCESS)
Main PID: 3496 (flanneld)
?? Memory: 6.6M
?? CGroup: /system.slice/flanneld.service
? ? ? ? ?? └─3496 /opt/k8s/bin/flanneld -etcd-cafile=/etc/kubernetes/cert/ca.pem -etcd-certfile=/etc/flann...

Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:10.998843? ? 3496 main.go:300] Wrote subnet fi....env
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:10.998850? ? 3496 main.go:304] Running backend.
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.008631? ? 3496 iptables.go:115] Some iptabl...ules
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.008643? ? 3496 iptables.go:137] Deleting ip...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.017130? ? 3496 vxlan_network.go:60] watchin...ases
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.017744? ? 3496 main.go:396] Waiting for 22h...ease
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.020469? ? 3496 iptables.go:137] Deleting ip...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.022207? ? 3496 iptables.go:125] Adding ipta...CEPT
Apr 08 14:56:11 k8s-master1 flanneld[3496]: I0408 14:56:11.031517? ? 3496 iptables.go:125] Adding ipta...CEPT
Apr 08 14:56:11 k8s-master1 systemd[1]: Started Flanneld overlay address etcd agent.
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# systemctl status docker
● docker.service - Docker Application Container Engine
?? Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
?? Active: active (running) since Mon 2019-04-08 14:56:12 CST; 1min 59s ago
? ?? Docs: https://docs.docker.com
Main PID: 3641 (dockerd)
?? Memory: 68.8M
?? CGroup: /system.slice/docker.service
? ? ? ? ?? ├─3641 /usr/bin/dockerd --log-level=error --bip=172.30.45.1/24 --ip-masq=true --mtu=1450
? ? ? ? ?? └─3647 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock --metri...

Apr 08 14:56:11 k8s-master1 systemd[1]: Starting Docker Application Container Engine...
Apr 08 14:56:12 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:12.329436766+08:00" level=error msg=...ist"
Apr 08 14:56:12 k8s-master1 systemd[1]: Started Docker Application Container Engine.
Apr 08 14:56:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:13.902502953+08:00" level=error msg=...ped"
Apr 08 14:56:14 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:14.163449193+08:00" level=error msg=...071"
Apr 08 14:56:14 k8s-master1 dockerd[3641]: time="2019-04-08T14:56:14.164426984+08:00" level=error msg=...071"
Apr 08 14:57:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:57:13.945446893+08:00" level=error msg=...ped"
Apr 08 14:57:13 k8s-master1 dockerd[3641]: time="2019-04-08T14:57:13.953866536+08:00" level=error msg=...ped"
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]# systemctl status kubelet
● kubelet.service - Kubernetes Kubelet
?? Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
?? Active: active (running) since Mon 2019-04-08 14:56:12 CST; 2min 7s ago
? ?? Docs: https://github.com/GoogleCloudPlatform/kubernetes
Main PID: 3768 (kubelet)
?? Memory: 126.8M
?? CGroup: /system.slice/kubelet.service
? ? ? ? ?? └─3768 /opt/k8s/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/kubelet-bootstrap.kubeconfig...

Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.299679? ? 3768 reconciler.go:412] Reconciler syn...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406547? ? 3768 reconciler.go:181] operationExecu...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406580? ? 3768 reconciler.go:181] operationE...89")
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406595? ? 3768 reconciler.go:181] operationExecu...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406715? ? 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406782? ? 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.406978? ? 3768 operation_generator.go:698] Unmou...
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507212? ? 3768 reconciler.go:301] Volume det...h ""
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507243? ? 3768 reconciler.go:301] Volume det...h ""
Apr 08 14:56:14 k8s-master1 kubelet[3768]: I0408 14:56:14.507261? ? 3768 reconciler.go:301] Volume det...h ""
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ~]#

6.
故障原因思考:

为什么会突然集群崩溃?说莫名奇妙那是忽悠自己的话.
前几天因为一次etcd故障,做过一次操作.
记录见下:

etcd集群报错

[[email protected] ~]# kubectl get cs
NAME? ? ? ? ? ? ? ?? STATUS? ? ? MESSAGE? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?? ERROR
etcd-1? ? ? ? ? ? ?? Unhealthy?? Get https://192.168.32.129:2379/health: dial tcp 192.168.32.129:2379: connect: connection refused??
scheduler? ? ? ? ? ? Healthy? ?? ok? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
controller-manager?? Healthy? ?? ok? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
etcd-0? ? ? ? ? ? ?? Healthy? ?? {"health":"true"}? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
etcd-2? ? ? ? ? ? ?? Healthy? ?? {"health":"true"}? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ??
[[email protected] ~]#
[[email protected] cert]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 cluster-health
failed to check the health of member 5bac98ba2781a51e on https://192.168.32.129:2379: Get https://192.168.32.129:2379/health: dial tcp 192.168.32.129:2379: getsockopt: connection refused
member 5bac98ba2781a51e is unreachable: [https://192.168.32.129:2379] are all unreachable
member bd8793282fb7e56f is healthy: got healthy result from https://192.168.32.130:2379
member bee1cc9618cefbee is healthy: got healthy result from https://192.168.32.128:2379
cluster is degraded
[[email protected] cert]#
[[email protected] cert]# etcdctl --ca-file=/etc/kubernetes/cert/ca.pem --cert-file=/etc/etcd/cert/etcd.pem --key-file=/etc/etcd/cert/etcd-key.pem --endpoints=https://192.168.32.128:2379,https://192.168.32.129:2379,https://192.168.32.130:2379 member list
5bac98ba2781a51e: name=k8s-master2 peerURLs=https://192.168.32.129:2380 clientURLs=https://192.168.32.129:2379 isLeader=false
bd8793282fb7e56f: name=k8s-master3 peerURLs=https://192.168.32.130:2380 clientURLs=https://192.168.32.130:2379 isLeader=false
bee1cc9618cefbee: name=k8s-master1 peerURLs=https://192.168.32.128:2380 clientURLs=https://192.168.32.128:2379 isLeader=true
[[email protected] cert]#

etcd日志

journalctl -u  etcd

Mar 28 09:55:11 k8s-master2 systemd[1]: Starting Etcd Server...
Mar 28 09:55:11 k8s-master2 etcd[2415]: etcd Version: 3.3.7
Mar 28 09:55:11 k8s-master2 etcd[2415]: Git SHA: 56536de55
Mar 28 09:55:11 k8s-master2 etcd[2415]: Go Version: go1.9.6
Mar 28 09:55:11 k8s-master2 etcd[2415]: Go OS/Arch: linux/amd64
Mar 28 09:55:11 k8s-master2 etcd[2415]: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 28 09:55:11 k8s-master2 etcd[2415]: the server is already initialized as member before, starting as etcd member...
Mar 28 09:55:11 k8s-master2 etcd[2415]: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pem, ca = , trusted-ca = /etc/kube
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for peers on https://192.168.32.129:2380
Mar 28 09:55:11 k8s-master2 etcd[2415]: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored k
Mar 28 09:55:11 k8s-master2 etcd[2415]: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is ena
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for client requests on 127.0.0.1:2379
Mar 28 09:55:11 k8s-master2 etcd[2415]: listening for client requests on 192.168.32.129:2379
Mar 28 09:55:12 k8s-master2 etcd[2415]: recovered store from snapshot at index 2000020
Mar 28 09:55:12 k8s-master2 etcd[2415]: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn‘t ex
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doe
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic: runtime error: invalid memory address or nil pointer dereference
Mar 28 09:55:12 k8s-master2 etcd[2415]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Mar 28 09:55:12 k8s-master2 etcd[2415]: goroutine 1 [running]:
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201a3c88, 0xc4201
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: panic(0xe1d2e0, 0xc42034af00)
Mar 28 09:55:12 k8s-master2 etcd[2415]: /usr/local/go/src/runtime/panic.go:491 +0x283
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420171ea0, 0x
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffdfe599c5f, 0xb, 0x0, 0
Mar 28 09:55:12 k8s-master2 etcd[2415]: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/
Mar 28 09:55:12 k8s-master2 etcd[2415]: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201d8000, 0xc4201d8480, 0x0,
Mar 28 09:55:12 k8s-master2 systemd[1]: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 28 09:55:12 k8s-master2 systemd[1]: Failed to start Etcd Server.
Mar 28 09:55:12 k8s-master2 systemd[1]: Unit etcd.service entered failed state.
Mar 28 09:55:12 k8s-master2 systemd[1]: etcd.service failed.
Mar 28 09:55:17 k8s-master2 systemd[1]: etcd.service holdoff time over, scheduling restart.
/var/log/messages

Mar 28 10:56:53 k8s-master2 systemd: Starting Etcd Server...
Mar 28 10:56:53 k8s-master2 etcd: etcd Version: 3.3.7
Mar 28 10:56:53 k8s-master2 etcd: Git SHA: 56536de55
Mar 28 10:56:53 k8s-master2 etcd: Go Version: go1.9.6
Mar 28 10:56:53 k8s-master2 etcd: Go OS/Arch: linux/amd64
Mar 28 10:56:53 k8s-master2 etcd: setting maximum number of CPUs to 1, total number of available CPUs is 1
Mar 28 10:56:53 k8s-master2 etcd: the server is already initialized as member before, starting as etcd member...
Mar 28 10:56:53 k8s-master2 etcd: peerTLS: cert = /etc/etcd/cert/etcd.pem, key = /etc/etcd/cert/etcd-key.pem, ca = , trusted-ca = /etc/kubernetes/cert/ca.pem, client-cert-auth = true, crl-file =
Mar 28 10:56:53 k8s-master2 etcd: listening for peers on https://192.168.32.129:2380
Mar 28 10:56:53 k8s-master2 etcd: The scheme of client url http://127.0.0.1:2379 is HTTP while peer key/cert files are presented. Ignored key/cert files.
Mar 28 10:56:53 k8s-master2 etcd: The scheme of client url http://127.0.0.1:2379 is HTTP while client cert auth (--client-cert-auth) is enabled. Ignored client cert auth for this url.
Mar 28 10:56:53 k8s-master2 etcd: listening for client requests on 127.0.0.1:2379
Mar 28 10:56:53 k8s-master2 etcd: listening for client requests on 192.168.32.129:2379
Mar 28 10:56:53 k8s-master2 etcd: recovered store from snapshot at index 2000020
Mar 28 10:56:53 k8s-master2 etcd: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn‘t exist
Mar 28 10:56:53 k8s-master2 etcd: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn‘t exist
Mar 28 10:56:53 k8s-master2 etcd: panic: runtime error: invalid memory address or nil pointer dereference
Mar 28 10:56:53 k8s-master2 etcd: [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xbccc10]
Mar 28 10:56:53 k8s-master2 etcd: goroutine 1 [running]:
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201a3c88, 0xc4201a3868)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:290 +0x40
Mar 28 10:56:53 k8s-master2 etcd: panic(0xe1d2e0, 0xc4202a1150)
Mar 28 10:56:53 k8s-master2 etcd: /usr/local/go/src/runtime/panic.go:491 +0x283
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420171ea0, 0x1011c8c, 0x2a, 0xc4201a38f8, 0x1, 0x1)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x16d
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0x7ffc7f287c5f, 0xb, 0x0, 0x0, 0x0, 0x0, 0xc4200db600, 0x1, 0x1, 0xc4200db400, ...)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/server.go:385 +0x2b18
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc4201e4000, 0xc4201e4480, 0x0, 0x0)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed/etcd.go:179 +0x870
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc4201e4000, 0x6, 0xff05c7, 0x6, 0x1)
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:181 +0x40
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/etcd.go:102 +0x151e
Mar 28 10:56:53 k8s-master2 etcd: github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
Mar 28 10:56:53 k8s-master2 systemd: etcd.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain/main.go:46 +0x3f
Mar 28 10:56:53 k8s-master2 etcd: main.main()
Mar 28 10:56:53 k8s-master2 etcd: /tmp/etcd-release-3.3.7/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20
Mar 28 10:56:53 k8s-master2 systemd: Failed to start Etcd Server.
Mar 28 10:56:53 k8s-master2 systemd: Unit etcd.service entered failed state.
Mar 28 10:56:53 k8s-master2 systemd: etcd.service failed.

通过报错记录

Mar 28 10:56:53 k8s-master2 etcd: recovered store from snapshot at index 2000020
Mar 28 10:56:53 k8s-master2 etcd: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn‘t exist
Mar 28 10:56:53 k8s-master2 etcd: panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn‘t exist

可知,找不到数据库的快照

数据存放在哪?

?[[email protected] snap]# ll
total 3768
-rw-r--r-- 1 k8s k8s?? 86575 Mar 21 12:21 000000000000077f-0000000000186a10.snap
-rw-r--r-- 1 k8s k8s?? 90505 Mar 25 10:42 000000000000078a-000000000019f0b1.snap
-rw-r--r-- 1 k8s k8s?? 90505 Mar 25 17:08 000000000000078a-00000000001b7752.snap
-rw-r--r-- 1 k8s k8s?? 93947 Mar 26 14:31 00000000000007a2-00000000001cfdf3.snap
-rw-r--r-- 1 k8s k8s?? 97869 Mar 27 12:52 00000000000007a7-00000000001e8494.snap
-rw------- 1 k8s k8s 3387392 Mar 28 11:17 db
[[email protected] snap]# pwd
/var/lib/etcd/member/snap
[[email protected] snap]#

操作思路和今天的一样,我删除了etcd1的所有数据,重新做了初始化
[[email protected]k8s-master2 ~]# rm -rf /var/lib/etcd/*

怀疑就是因为上次这个操作导致etcd集群的数据不完整,引发今天的整个集群崩溃.

7.
因为没有做etcd数据的备份,之前运行的pod,svc等等所有数据全部丢失.
故障处理彻底失败.

原文地址:https://blog.51cto.com/goome/2375348

时间: 2024-11-10 00:42:49

k8s实践9:一次失败的kubernetes集群崩溃处理记录的相关文章

阿里巴巴大规模神龙裸金属 Kubernetes 集群运维实践

作者 | 姚捷(喽哥)阿里云容器平台集群管理高级技术专家 本文节选自<不一样的 双11 技术:阿里巴巴经济体云原生实践>一书,点击即可完成下载. 导读:值得阿里巴巴技术人骄傲的是 2019 年阿里巴巴 双11?核心系统 100% 以云原生的方式上云,完美支撑了?54.4w 峰值流量以及?2684 亿的成交量.背后承载海量交易的计算力就是来源于容器技术与神龙裸金属的完美融合. 集团上云机器资源形态 阿里巴巴 双11 采用三地五单元架构,除 2 个混部单元外,其他 3 个均是云单元.神龙机型经过

Kubernetes(K8s) 安装(使用kubeadm安装Kubernetes集群)

概述: 这篇文章是为了介绍使用kubeadm安装Kubernetes集群(可以用于生产级别).使用了Centos 7系统. PS: 篇幅有点长,但是比较详细.比较全面 一.Centos7 配置说明 1.1   Firewalld(防火墙) CentOS Linux 7 默认开起来防火墙服务(firewalld),而Kubernetes的Master与工作Node之间会有大量的网络通信,安全的做法是在防火墙上配置Kbernetes各组件(api-server.kubelet等等)需要相互通信的端口

在kubernetes 集群内访问k8s API服务

所有的 kubernetes 集群中账户分为两类,Kubernetes 管理的 serviceaccount(服务账户) 和 useraccount(用户账户).基于角色的访问控制(“RBAC”)使用“rbac.authorization.k8s.io”API 组来实现授权控制,允许管理员通过Kubernetes API动态配置策略. API Server 内部通过用户认证后,然后进入授权流程.对合法用户进行授权并且随后在用户访问时进行鉴权,是权限管理的重要环节.在 kubernetes 集群中

Kubeadm部署Kubernetes集群

Kubeadm部署Kubernetes1.14.1集群 原理kubeadm做为集群安装的"最佳实践"工具,目标是通过必要的步骤来提供一个最小可用的集群运行环境.它会启动集群的基本组件以及必要的附属组件,至于为集群提供更丰富功能(比如监控,度量)的组件,不在其安装部署的范围.在环境节点符合其基本要求的前提下,kubeadm只需要两条基本命令便可以快捷的将一套集群部署起来.这两条命令分别是: kubeadm init:初始化集群并启动master相关组件,在计划用做master的节点上执行

Shell脚本快速部署Kubernetes集群系统

本文紧跟上节所讲的手动部署Kubernetes管理Docker篇所写,本篇主要内容利用Shell脚本完成快速部署Kubernetes集群.上节博文看过的朋友也能感觉到部署过程相对比较简单,那么,出于简化工作流程,推进运维自动化角度来说,于是花了2/3天时间写这个部署Kubernetes脚本. 运维工作中,常常会遇到部署各种各样的服务,建议:常规部署都应该尽量使用脚本完成,一方面提高自身脚本编写能力,另一方面推进运维自动化. 详细部署说明文档:http://lizhenliang.blog.51c

CentOS7.5 使用 kubeadm 安装配置 Kubernetes 集群(四)

在之前的文章,我们已经演示了 yum 和二进制方式的安装方式,本文我们将用官方推荐的 kubeadm 来进行安装部署. kubeadm 是 Kubernetes 官方提供的用于快速安装Kubernetes集群的工具,伴随Kubernetes每个版本的发布都会同步更新,kubeadm会对集群配置方面的一些实践做调整,通过实验kubeadm可以学习到Kubernetes官方在集群配置上一些新的最佳实践. 一.环境准备 1.软件版本 软件 版本 kubernetes v1.12.2 CentOS 7.

阿里云ECS搭建Kubernetes集群踩坑记

阿里云ECS搭建Kubernetes集群踩坑记 [TOC] 1. 现有环境.资源 资源 数量 规格 EIP 1 5M带宽 ECS 3 2 vCPU 16 GB内存 100G硬盘 ECS 3 2 vCPU 16 GB内存 150G硬盘 SLB 2 私网slb.s1.small 2. 规划 坑: 上网问题,因为只有一个EIP,所有其它节点只能通过代理上网; 负载均衡问题,因为阿里不支持LVS,负载均衡TCP方式后端又不支持访问负载均衡,HTTP和HTTPS方式,只支持后端协议为HTTP; 为了避免上

【prometheus】kubernetes集群性能监控

手动安装方案——阿里云解决方案(部署失败): https://www.jianshu.com/p/1c7ddf18e8b2 手动安装方案—— (部署成功,但是只有CPU内存等的监控信息,没有GPU的监控信息): https://github.com/camilb/prometheus-kubernetes/tree/master helm安装方案——GPU-Monitoring-tools解决方案(部署成功): 参考:http://fly-luck.github.io/2018/12/10/gp

kubernetes集群安装Jenkins实现cicd

一.安装Jenkins 1. 安装存储服务器 找一台服务器搭建一台nfs服务器<<详见Ubuntu16.04 安装nfs>> 系统:Ubuntu 16.04 IP:172.18.1.13 apt install nfs-common nfs-kernel-server -y #配置挂载信息 cat /etc/exports /data/k8s *(rw,sync,no_root_squash) #给目录添加权限 chmod -R 777 /data/k8s #启动 /etc/ini