根据k8s架构我们直到,k8s中的所有api操作的元数据最终都存储在etcd中。所以,只要etcd中的数据还在,理论上能将整个k8s集群恢复回来。
访问k8s中的etcd,
如果使用kubeadm安装的k8s,则etcd运行在容器中,登录到master上执行如下命令:
1
2
3
4
5
6
7
8
9
10
| ETCD=`docker ps|grep etcd|grep -v POD|awk '{print $1}'`
#列出所有节点
docker exec -it ${ETCD} etcdctl --endpoints https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member list
#获取etcd集群中故障member的ID
docker exec -it ${ETCD} etcdctl --endpoints https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key cluster-health
#删除不健康的节点
docker exec -it ${ETCD} etcdctl --endpoints https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key member remove fe28bd7d83b05920
|
数据恢复情景一:
etcd集群能正常访问。
这种情况可以使用etcdctl命令,生成快照,然后重新启动一个etcd集群,然后根据快照文件恢复集群。
命令:
1
2
3
4
5
6
7
8
9
10
| #生成快照:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db
#在新集群恢复数据:
ETCDCTL_API=3 etcdctl $ENDPOINT restore snapshot.db \
--name m1 \
--initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls http://host1:2380 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key
|
数据恢复情景二:
etcd集群已经不能正常访问。此时只能根据etcd运行时生成的快照文件恢复。快照文件在etcd数据目录的member/snap/db
文件,如果是kubeadm安装的,则数据文件在:/var/lib/etcd/member/snap/db
。
如果使用db文件来恢复,则要加一个--skip-hash-check
参数。官方解释:If the snapshot is copied from the data directory, there is no integrity hash and it will only restore by using –skip-hash-check.
命令:
1
2
3
4
5
6
| docker exec -it ${ETCD} etcdctl snapshot restore /var/lib/etcd/member_bak/snap/db \
--name=ecs-db98-0003 \
--initial-advertise-peer-urls=https://host:2380 \
--initial-cluster=ecs-db98-0003=https://host:2380 \
--data-dir=/var/lib/etcd \
--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key --skip-hash-check
|
恢复前,/var/lib/etcd必须是空目录。
恢复完成后,启动k8s其他组件,使用kubectl get nodes
等命令,检查集群恢复状况。
etcd snapshot备份:
1
2
| DATE=`date "+%Y-%m-%d %H%M%S"`
docker exec -it ${ETCD} etcdctl --endpoints https://10.0.107.85:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key snapshot save /var/lib/etcd/snapshot_${DATE}.db
|