K8S组件calico重建过程

07-16 1074阅读

问题背景:周一来了以后看到calico-node组件pod重启100多次,查看日志发现warning日志:

K8S组件calico重建过程
(图片来源网络,侵删)

Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get “http://localhost:9099/readiness”: dial tcp [::1]:9099: connect: connection refused

一、问题日志

  • 频繁重启

    [root@master ~]# kubectl get pods -n calico-system -o wide 
    NAMESPACE              NAME                                         READY   STATUS    RESTARTS          AGE     IP               NODE     NOMINATED NODE   READINESS GATES
    aliang-cka             web-5dc86dfc-t7nrb                           1/1     Running   0                 2d16h   10.244.140.72    node02              
    calico-apiserver       calico-apiserver-bb689689-b5v88              1/1     Running   0                 2d19h   10.244.196.131   node01              
    calico-apiserver       calico-apiserver-bb689689-dwlf4              1/1     Running   0                 2d19h   10.244.140.66    node02              
    calico-system          calico-kube-controllers-58d9bdcc64-tfqgx     1/1     Running   0                 2d23h   10.244.219.65    master              
    calico-system          calico-node-dr6ch                            1/1     Running   128 (64m ago)     2d23h   192.168.0.12     node01              
    calico-system          calico-node-lj89c                            1/1     Running   140 (2m44s ago)   2d23h   192.168.0.13     node02              
    calico-system          calico-node-vrz58                            1/1     Running   138 (45s ago)     2d23h   192.168.0.11     master              
    calico-system          calico-typha-578cfdc69-95f9b                 1/1     Running   167 (2s ago)      2d23h   192.168.0.13     node02              
    calico-system          calico-typha-578cfdc69-zhffj                 1/1     Running   121 (108m ago)    2d23h   192.168.0.12     node01              
    calico-system          csi-node-driver-5ntdf                        2/2     Running   0                 2d23h   10.244.219.68    master              
    calico-system          csi-node-driver-9psnp                        2/2     Running   0                 2d23h   10.244.140.65    node02              
    calico-system          csi-node-driver-fz67c                        2/2     Running   0                 2d23h   10.244.196.129   node01              
    
    • calico-node Events日志

      Events:
        Type     Reason     Age   From     Message
        ----     ------     ----  ----     -------
        Warning  Unhealthy  23m   kubelet  Readiness probe failed: 2024-07-15 01:27:04.839 [INFO][3310] confd/health.go 180: Number of node(s) with BGP peering established = 2
      calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
        Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-07-15 01:27:14.839 [INFO][3320] confd/health.go 180: Number of node(s) with BGP peering established = 2
      calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
        Warning  Unhealthy  20m  kubelet  Readiness probe failed: 2024-07-15 01:30:24.839 [INFO][3553] confd/health.go 180: Number of node(s) with BGP peering established = 2
      calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
        Warning  Unhealthy  16m  kubelet  Readiness probe failed: 2024-07-15 01:34:44.839 [INFO][3867] confd/health.go 180: Number of node(s) with BGP peering established = 2
      calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
        Warning  Unhealthy  9m (x1666 over 2d18h)    kubelet  Liveness probe failed: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused
        Warning  Unhealthy  110s (x3936 over 2d18h)  kubelet  (combined from similar events): Readiness probe failed: 2024-07-15 01:49:04.836 [INFO][4911] confd/health.go 180: Number of node(s) with BGP peering established = 2
      

      二、解决办法:

      • 1.完全删除calico-node pod服务。

        # 在master节点执行删除calico相关pod service,deployment namespace
        kubectl delete -f tigera-operator.yaml
        kubectl delete -f custom-resources.yaml
        # 以上命令执行后如果发现有Error返回,检查calico相关pod service,deployment namespace,手动删除,即删除calico-system命名空间下的所有服务
         kubectl delete pod -n calico-system csi-node-driver-jhdvh csi-node-driver-9nmrb csi-node-driver-2w8p8 calico-node-x7spm calico-node-8z8rm calico-node-78ffv 
         kubectl delete deployment -n  calico-system  calico-typha calico-kube-controllers 
         kubectl delete deployment -n  calico-apiserver calico-apiserver
         
         kubectl delete svc -n  calico-system  calico-typha calico-kube-controllers 
         kubectl delete svc -n  calico-apiserver calico-apiserver
         
         kubectl delete ns calico-apiserver
         kubectl delete ns calico-system
         
        # 不出意外的情况下,在删除calico-system 命名空间的时候会删不掉,calico-system状态变成了Terminating
        [root@master ]# kubectl get ns -A
        NAME                   STATUS        AGE
        calico-system          Terminating   3d1h
        default                Active        3d1h
        kube-node-lease        Active        3d1h
        kube-public            Active        3d1h
        kube-system            Active        3d1h
        kubernetes-dashboard   Active        2d19h
        # 删不掉的解决办法:
        # 1.先导出配置文件
        kubectl get ns calico-system -o json > tmp.json
        # 2.修改导出文件,删除其中的finalizers这一项,其他不变,然后保存。
        ....
                "resourceVersion": "624892",
                "uid": "fa96ef83-497e-4bc7-a98a-39660e90fd32"
            },
            "spec": {
                "finalizers": [   # 删除这个finalizers数组
                    "kubernetes"  
                ]
            },
            "status": {
                "phase": "Active"
            }
        }
        ....
        # 3.在当前终端开启代理 kubectl proxy
        [root@master ]# kubectl proxy
        Starting to serve on 127.0.0.1:8001
        # 4.再开一个终端,通过curl调用api删除,无输出
        curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/calico-system/finalize
        # 5.再次查看namespace,calico-system被删掉了。
        [root@master ~]# kubectl get ns -A
        NAME                   STATUS   AGE
        default                Active   3d1h
        kube-node-lease        Active   3d1h
        kube-public            Active   3d1h
        kube-system            Active   3d1h
        kubernetes-dashboard   Active   2d19h
        # 6.将所有节点的/etc/cni/net.d/目录清空,然后重启所有节点的kubelet
        rm -rf /etc/cni/net.d/*
        systemctl restart kubelet
        # 7.coredns的pod将会重启变成pending状态,calico删除完成!
        
        • 2.重建calico组件

          # 2.1重建之前检查各个节点的时间同步情况,没有同步的一定要先同步
          ntpdate ntp.aliyun.com
          # 2.2重建calico服务
          # 下载 
          wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml
          wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/custom-resources.yaml
          # 修改custom-resources.yaml文件中 CIDR,默认是 192.168.0.0/16,修改为创建集群时的IP段,
          # 我这里创建集群时用的 10.244.0.0/16,若与集群IP段与官网配置文件一直,则无需修改。
          ....
          calicoNetwork:
              # Note: The ipPools section cannot be modified post-install.
              ipPools:
              - blockSize: 26
                cidr: 10.244.0.0/16  # 修改此处
                encapsulation: VXLANCrossSubnet
                natOutgoing: Enabled
                nodeSelector: all()
          ....
          # 执行calico部署文件
          kubectl create -f tigera-operator.yaml
          kubectl create -f custom-resources.yaml
          # 等待pod启动,如果之前镜像没有删除的话,重建会比较快的,否则会重新拉取镜像,比较耗时。
          # 重建完成
          [root@master calico-operator]# kubectl get pods -n calico-system
          NAME                                       READY   STATUS    RESTARTS   AGE
          calico-kube-controllers-58d9bdcc64-vzm9r   1/1     Running   0          5m15s
          calico-node-5p7qf                          1/1     Running   0          5m16s
          calico-node-9lnmn                          1/1     Running   0          5m16s
          calico-node-hpxdr                          1/1     Running   0          5m16s
          calico-typha-65b4547c94-46fll              1/1     Running   0          5m8s
          calico-typha-65b4547c94-qb2tx              1/1     Running   0          5m16s
          csi-node-driver-jrx88                      2/2     Running   0          5m16s
          csi-node-driver-kw6d6                      2/2     Running   0          5m16s
          csi-node-driver-wdhk7                      2/2     Running   0          5m16s
          
VPS购买请点击我

文章版权声明:除非注明,否则均为主机测评原创文章,转载或复制请以超链接形式并注明出处。

目录[+]