K8S组件calico重建过程
问题背景:周一来了以后看到calico-node组件pod重启100多次,查看日志发现warning日志:
(图片来源网络,侵删)
Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get “http://localhost:9099/readiness”: dial tcp [::1]:9099: connect: connection refused
一、问题日志
-
频繁重启
[root@master ~]# kubectl get pods -n calico-system -o wide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aliang-cka web-5dc86dfc-t7nrb 1/1 Running 0 2d16h 10.244.140.72 node02 calico-apiserver calico-apiserver-bb689689-b5v88 1/1 Running 0 2d19h 10.244.196.131 node01 calico-apiserver calico-apiserver-bb689689-dwlf4 1/1 Running 0 2d19h 10.244.140.66 node02 calico-system calico-kube-controllers-58d9bdcc64-tfqgx 1/1 Running 0 2d23h 10.244.219.65 master calico-system calico-node-dr6ch 1/1 Running 128 (64m ago) 2d23h 192.168.0.12 node01 calico-system calico-node-lj89c 1/1 Running 140 (2m44s ago) 2d23h 192.168.0.13 node02 calico-system calico-node-vrz58 1/1 Running 138 (45s ago) 2d23h 192.168.0.11 master calico-system calico-typha-578cfdc69-95f9b 1/1 Running 167 (2s ago) 2d23h 192.168.0.13 node02 calico-system calico-typha-578cfdc69-zhffj 1/1 Running 121 (108m ago) 2d23h 192.168.0.12 node01 calico-system csi-node-driver-5ntdf 2/2 Running 0 2d23h 10.244.219.68 master calico-system csi-node-driver-9psnp 2/2 Running 0 2d23h 10.244.140.65 node02 calico-system csi-node-driver-fz67c 2/2 Running 0 2d23h 10.244.196.129 node01
-
calico-node Events日志
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 23m kubelet Readiness probe failed: 2024-07-15 01:27:04.839 [INFO][3310] confd/health.go 180: Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused Warning Unhealthy 23m kubelet Readiness probe failed: 2024-07-15 01:27:14.839 [INFO][3320] confd/health.go 180: Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused Warning Unhealthy 20m kubelet Readiness probe failed: 2024-07-15 01:30:24.839 [INFO][3553] confd/health.go 180: Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused Warning Unhealthy 16m kubelet Readiness probe failed: 2024-07-15 01:34:44.839 [INFO][3867] confd/health.go 180: Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused Warning Unhealthy 9m (x1666 over 2d18h) kubelet Liveness probe failed: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused Warning Unhealthy 110s (x3936 over 2d18h) kubelet (combined from similar events): Readiness probe failed: 2024-07-15 01:49:04.836 [INFO][4911] confd/health.go 180: Number of node(s) with BGP peering established = 2
二、解决办法:
-
1.完全删除calico-node pod服务。
# 在master节点执行删除calico相关pod service,deployment namespace kubectl delete -f tigera-operator.yaml kubectl delete -f custom-resources.yaml # 以上命令执行后如果发现有Error返回,检查calico相关pod service,deployment namespace,手动删除,即删除calico-system命名空间下的所有服务 kubectl delete pod -n calico-system csi-node-driver-jhdvh csi-node-driver-9nmrb csi-node-driver-2w8p8 calico-node-x7spm calico-node-8z8rm calico-node-78ffv kubectl delete deployment -n calico-system calico-typha calico-kube-controllers kubectl delete deployment -n calico-apiserver calico-apiserver kubectl delete svc -n calico-system calico-typha calico-kube-controllers kubectl delete svc -n calico-apiserver calico-apiserver kubectl delete ns calico-apiserver kubectl delete ns calico-system # 不出意外的情况下,在删除calico-system 命名空间的时候会删不掉,calico-system状态变成了Terminating [root@master ]# kubectl get ns -A NAME STATUS AGE calico-system Terminating 3d1h default Active 3d1h kube-node-lease Active 3d1h kube-public Active 3d1h kube-system Active 3d1h kubernetes-dashboard Active 2d19h # 删不掉的解决办法: # 1.先导出配置文件 kubectl get ns calico-system -o json > tmp.json # 2.修改导出文件,删除其中的finalizers这一项,其他不变,然后保存。 .... "resourceVersion": "624892", "uid": "fa96ef83-497e-4bc7-a98a-39660e90fd32" }, "spec": { "finalizers": [ # 删除这个finalizers数组 "kubernetes" ] }, "status": { "phase": "Active" } } .... # 3.在当前终端开启代理 kubectl proxy [root@master ]# kubectl proxy Starting to serve on 127.0.0.1:8001 # 4.再开一个终端,通过curl调用api删除,无输出 curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/calico-system/finalize # 5.再次查看namespace,calico-system被删掉了。 [root@master ~]# kubectl get ns -A NAME STATUS AGE default Active 3d1h kube-node-lease Active 3d1h kube-public Active 3d1h kube-system Active 3d1h kubernetes-dashboard Active 2d19h # 6.将所有节点的/etc/cni/net.d/目录清空,然后重启所有节点的kubelet rm -rf /etc/cni/net.d/* systemctl restart kubelet # 7.coredns的pod将会重启变成pending状态,calico删除完成!
-
2.重建calico组件
# 2.1重建之前检查各个节点的时间同步情况,没有同步的一定要先同步 ntpdate ntp.aliyun.com # 2.2重建calico服务 # 下载 wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/custom-resources.yaml # 修改custom-resources.yaml文件中 CIDR,默认是 192.168.0.0/16,修改为创建集群时的IP段, # 我这里创建集群时用的 10.244.0.0/16,若与集群IP段与官网配置文件一直,则无需修改。 .... calicoNetwork: # Note: The ipPools section cannot be modified post-install. ipPools: - blockSize: 26 cidr: 10.244.0.0/16 # 修改此处 encapsulation: VXLANCrossSubnet natOutgoing: Enabled nodeSelector: all() .... # 执行calico部署文件 kubectl create -f tigera-operator.yaml kubectl create -f custom-resources.yaml # 等待pod启动,如果之前镜像没有删除的话,重建会比较快的,否则会重新拉取镜像,比较耗时。 # 重建完成 [root@master calico-operator]# kubectl get pods -n calico-system NAME READY STATUS RESTARTS AGE calico-kube-controllers-58d9bdcc64-vzm9r 1/1 Running 0 5m15s calico-node-5p7qf 1/1 Running 0 5m16s calico-node-9lnmn 1/1 Running 0 5m16s calico-node-hpxdr 1/1 Running 0 5m16s calico-typha-65b4547c94-46fll 1/1 Running 0 5m8s calico-typha-65b4547c94-qb2tx 1/1 Running 0 5m16s csi-node-driver-jrx88 2/2 Running 0 5m16s csi-node-driver-kw6d6 2/2 Running 0 5m16s csi-node-driver-wdhk7 2/2 Running 0 5m16s
-
-
-
文章版权声明:除非注明,否则均为主机测评原创文章,转载或复制请以超链接形式并注明出处。