根据k8s架构我们直到,k8s中的所有api操作的元数据最终都存储在etcd中。所以,只要etcd中的数据还在,理论上能将整个k8s集群恢复回来。

访问k8s中的etcd, 如果使用kubeadm安装的k8s,则etcd运行在容器中,登录到master上执行如下命令:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ETCD=`docker ps|grep etcd|grep -v POD|awk '{print $1}'`

#列出所有节点
docker exec   -it ${ETCD}   etcdctl   --endpoints https://127.0.0.1:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key   member list

#获取etcd集群中故障member的ID
docker exec   -it ${ETCD}   etcdctl   --endpoints https://127.0.0.1:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key   cluster-health

#删除不健康的节点
docker exec   -it ${ETCD}   etcdctl   --endpoints https://127.0.0.1:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key   member remove fe28bd7d83b05920

数据恢复情景一:

etcd集群能正常访问。

这种情况可以使用etcdctl命令,生成快照,然后重新启动一个etcd集群,然后根据快照文件恢复集群。 命令:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#生成快照:
ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db

#在新集群恢复数据:
ETCDCTL_API=3 etcdctl $ENDPOINT restore snapshot.db \
  --name m1 \
  --initial-cluster m1=http://host1:2380,m2=http://host2:2380,m3=http://host3:2380 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls http://host1:2380 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key

数据恢复情景二:

etcd集群已经不能正常访问。此时只能根据etcd运行时生成的快照文件恢复。快照文件在etcd数据目录的member/snap/db文件,如果是kubeadm安装的,则数据文件在:/var/lib/etcd/member/snap/db。 如果使用db文件来恢复,则要加一个--skip-hash-check参数。官方解释:If the snapshot is copied from the data directory, there is no integrity hash and it will only restore by using –skip-hash-check.

命令:

1
2
3
4
5
6
docker exec   -it ${ETCD}   etcdctl snapshot restore /var/lib/etcd/member_bak/snap/db \
--name=ecs-db98-0003 \
--initial-advertise-peer-urls=https://host:2380 \
--initial-cluster=ecs-db98-0003=https://host:2380 \
--data-dir=/var/lib/etcd \
--cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key  --skip-hash-check

恢复前,/var/lib/etcd必须是空目录。

恢复完成后,启动k8s其他组件,使用kubectl get nodes等命令,检查集群恢复状况。

etcd snapshot备份:

1
2
DATE=`date "+%Y-%m-%d %H%M%S"`
docker exec   -it ${ETCD}   etcdctl   --endpoints https://10.0.107.85:2379   --cacert=/etc/kubernetes/pki/etcd/ca.crt   --cert=/etc/kubernetes/pki/etcd/server.crt   --key=/etc/kubernetes/pki/etcd/server.key snapshot save /var/lib/etcd/snapshot_${DATE}.db