From 690823fa0da282ee90b2bc138acf03d8dd61053b Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Wed, 22 Apr 2026 15:09:40 +0800
Subject: [PATCH 1/2] [security] Expired Node Certificates Cause CSR Backlog
 and CNI Pod Crashes

---
 ...s_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md | 113 ++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
diff --git a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
new file mode 100644
index 00000000..8f1505ed
--- /dev/null
+++ b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
@@ -0,0 +1,113 @@
+---
+kind:
+   - Troubleshooting
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+Node certificates expire and the automatic renewal process stalls, causing a cascade of failures:
+
+- The API server rejects kubelet requests with x509 certificate errors:
+
+```
+Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid:
+current time xxxx-xx-xxxx is after xxxx-xx-xxxx
+```
+
+- CNI agent pods on affected nodes enter `CrashLoopBackOff`:
+
+```
+certificate signing request csr-xxxxx is approved, waiting to be issued
+failed to start the node certificate manager: certificate was not signed: context deadline exceeded
+```
+
+- Multiple Certificate Signing Requests (CSRs) remain in `Approved` state but are never issued.
+
+## Root Cause
+
+The kube-controller-manager's certificate signing controller stops issuing certificates even though CSRs have been approved. This creates a backlog of pending CSRs. Without valid certificates, the kubelet on affected nodes cannot authenticate to the API server, and CNI components that depend on node certificates fail to initialize.
+
+## Resolution
+
+Recover each affected node by removing the stale kubelet PKI and forcing certificate reissuance.
+
+### Step 1: Delete Stale Kubelet PKI
+
+SSH into the affected node and remove the PKI directory:
+
+```bash
+ssh <node-address>
+sudo rm -rf /var/lib/kubelet/pki
+```
+
+### Step 2: Restart the Kubelet
+
+```bash
+sudo systemctl restart kubelet
+```
+
+The kubelet generates a new private key and submits a fresh CSR to the API server upon startup.
+
+### Step 3: Approve Pending CSRs
+
+From a control-plane node or workstation with cluster admin access, approve the new CSR:
+
+```bash
+kubectl get csr
+kubectl certificate approve <csr-name>
+```
+
+If multiple nodes are affected, approve all pending CSRs at once:
+
+```bash
+kubectl get csr -o name | xargs -I{} kubectl certificate approve {}
+```
+
+### Step 4: Verify Recovery
+
+Confirm the node returns to `Ready` status:
+
+```bash
+kubectl get nodes
+```
+
+Check that CNI pods on the recovered node are running:
+
+```bash
+kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name>
+```
+
+## Diagnostic Steps
+
+### Inspect API Server Certificate Errors
+
+```bash
+kubectl logs -n kube-system kube-apiserver-<node-name> --tail=200 | \
+  grep "certificate has expired"
+```
+
+### List CSRs and Their Status
+
+```bash
+kubectl get csr
+```
+
+Look for CSRs stuck in `Approved` without progressing to `Issued`.
+
+### Verify kube-controller-manager Health
+
+```bash
+kubectl get pods -n kube-system | grep kube-controller-manager
+kubectl logs -n kube-system kube-controller-manager-<node-name> --tail=100
+```
+
+If the kube-controller-manager is unhealthy or showing certificate-related errors itself, troubleshoot the control-plane certificates first.
+
+### Check CNI Pod Logs
+
+```bash
+kubectl logs -n kube-system <cni-pod-name> --tail=50 | grep -i "certificate\|csr\|x509"
+```

From 18be374c0d91be85b932ab3b4f654cc8f90cd360 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Wed, 22 Apr 2026 19:03:19 +0800
Subject: [PATCH 2/2] [security] Expired Node Certificates Cause CSR Backlog
 and CNI Pod Crashes

---
 ...s_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md | 89 +++++++++++++++----
 1 file changed, 73 insertions(+), 16 deletions(-)

diff --git a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
index 8f1505ed..4c628c1e 100644
--- a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
+++ b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md
@@ -32,54 +32,111 @@ The kube-controller-manager's certificate signing controller stops issuing certi
 
 ## Resolution
 
-Recover each affected node by removing the stale kubelet PKI and forcing certificate reissuance.
+Recover each affected node by restoring a bootstrap path so kubelet can request a fresh client certificate, then approving the CSR it emits.
 
-### Step 1: Delete Stale Kubelet PKI
+> **Important on modern kubeadm clusters:** `/etc/kubernetes/kubelet.conf` references `/var/lib/kubelet/pki/kubelet-client-current.pem` as a **file** — not an inline-embedded cert. When that file is missing and `/etc/kubernetes/bootstrap-kubelet.conf` has already been removed (the standard post-join state), kubelet has no valid identity, cannot submit a CSR, and will crash-loop with `failed to run Kubelet: unable to load bootstrap kubeconfig`. Simply `rm`-ing the PKI dir and restarting kubelet is **not enough** — you must also hand kubelet a fresh bootstrap kubeconfig first.
 
-SSH into the affected node and remove the PKI directory:
+### Step 1: Create a Bootstrap Token
+
+On a control-plane node (or anywhere with cluster-admin kubectl), mint a bootstrap token the affected node can use to authenticate:
+
+```bash
+kubeadm token create --print-join-command
+```
+
+If `kubeadm` is not on the workstation, create the token Secret directly:
+
+```bash
+TOKEN_ID=$(head -c 3 /dev/urandom | xxd -p)
+TOKEN_SECRET=$(head -c 8 /dev/urandom | xxd -p)
+EXP=$(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%SZ)
+cat <<EOF | kubectl apply -f -
+apiVersion: v1
+kind: Secret
+metadata:
+  name: bootstrap-token-${TOKEN_ID}
+  namespace: kube-system
+type: bootstrap.kubernetes.io/token
+stringData:
+  token-id: ${TOKEN_ID}
+  token-secret: ${TOKEN_SECRET}
+  usage-bootstrap-authentication: "true"
+  usage-bootstrap-signing: "true"
+  auth-extra-groups: system:bootstrappers:kubeadm:default-node-token
+  expiration: "${EXP}"
+EOF
+echo "token: ${TOKEN_ID}.${TOKEN_SECRET}"
+```
+
+### Step 2: Write Bootstrap Kubeconfig on the Node
+
+On the affected node, rebuild `/etc/kubernetes/bootstrap-kubelet.conf`. Reuse the CA data that's already embedded in `/etc/kubernetes/kubelet.conf`:
 
 ```bash
 ssh <node-address>
-sudo rm -rf /var/lib/kubelet/pki
+CA_DATA=$(sudo grep certificate-authority-data /etc/kubernetes/kubelet.conf | awk '{print $2}')
+API_SERVER=$(sudo grep "server:" /etc/kubernetes/kubelet.conf | awk '{print $2}')
+sudo tee /etc/kubernetes/bootstrap-kubelet.conf >/dev/null <<EOF
+apiVersion: v1
+clusters:
+- cluster:
+    certificate-authority-data: ${CA_DATA}
+    server: ${API_SERVER}
+  name: default-cluster
+contexts:
+- context:
+    cluster: default-cluster
+    user: kubelet-bootstrap
+  name: default-context
+current-context: default-context
+kind: Config
+users:
+- name: kubelet-bootstrap
+  user:
+    token: <TOKEN_FROM_STEP_1>
+EOF
+sudo chmod 600 /etc/kubernetes/bootstrap-kubelet.conf
 ```
 
-### Step 2: Restart the Kubelet
+### Step 3: Remove Stale PKI and Restart Kubelet
+
+With the bootstrap kubeconfig in place, kubelet has a way to authenticate when the file-backed client cert is missing:
 
 ```bash
+sudo rm -rf /var/lib/kubelet/pki
 sudo systemctl restart kubelet
 ```
 
-The kubelet generates a new private key and submits a fresh CSR to the API server upon startup.
+Kubelet now boots in bootstrap mode, generates a fresh private key, and submits a CSR signed with the bootstrap token.
 
-### Step 3: Approve Pending CSRs
+### Step 4: Approve the Pending CSR
 
-From a control-plane node or workstation with cluster admin access, approve the new CSR:
+From the workstation with cluster-admin access:
 
 ```bash
 kubectl get csr
 kubectl certificate approve <csr-name>
 ```
 
-If multiple nodes are affected, approve all pending CSRs at once:
+For multiple affected nodes at once:
 
 ```bash
 kubectl get csr -o name | xargs -I{} kubectl certificate approve {}
 ```
 
-### Step 4: Verify Recovery
+The built-in CSR approver may also approve it automatically (via group `system:bootstrappers:kubeadm:default-node-token`), so this step is sometimes a no-op.
 
-Confirm the node returns to `Ready` status:
+### Step 5: Verify Recovery
 
-```bash
-kubectl get nodes
-```
-
-Check that CNI pods on the recovered node are running:
+Confirm the node returns to `Ready` and the CNI pods there are running:
 
 ```bash
+kubectl get nodes
 kubectl get pods -n kube-system --field-selector spec.nodeName=<node-name>
 ```
 
+After the node stabilizes, `/etc/kubernetes/bootstrap-kubelet.conf` is no longer needed — you can delete it to prevent a stale token from lingering on disk.
+
 ## Diagnostic Steps
 
 ### Inspect API Server Certificate Errors