From 690823fa0da282ee90b2bc138acf03d8dd61053b Mon Sep 17 00:00:00 2001 From: Komh Date: Wed, 22 Apr 2026 15:09:40 +0800 Subject: [PATCH 1/2] [security] Expired Node Certificates Cause CSR Backlog and CNI Pod Crashes --- ...s_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md | 113 ++++++++++++++++++ 1 file changed, 113 insertions(+) create mode 100644 docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md diff --git a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md new file mode 100644 index 00000000..8f1505ed --- /dev/null +++ b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md @@ -0,0 +1,113 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +Node certificates expire and the automatic renewal process stalls, causing a cascade of failures: + +- The API server rejects kubelet requests with x509 certificate errors: + +``` +Unable to authenticate the request" err="[x509: certificate has expired or is not yet valid: +current time xxxx-xx-xxxx is after xxxx-xx-xxxx +``` + +- CNI agent pods on affected nodes enter `CrashLoopBackOff`: + +``` +certificate signing request csr-xxxxx is approved, waiting to be issued +failed to start the node certificate manager: certificate was not signed: context deadline exceeded +``` + +- Multiple Certificate Signing Requests (CSRs) remain in `Approved` state but are never issued. + +## Root Cause + +The kube-controller-manager's certificate signing controller stops issuing certificates even though CSRs have been approved. This creates a backlog of pending CSRs. Without valid certificates, the kubelet on affected nodes cannot authenticate to the API server, and CNI components that depend on node certificates fail to initialize. + +## Resolution + +Recover each affected node by removing the stale kubelet PKI and forcing certificate reissuance. + +### Step 1: Delete Stale Kubelet PKI + +SSH into the affected node and remove the PKI directory: + +```bash +ssh +sudo rm -rf /var/lib/kubelet/pki +``` + +### Step 2: Restart the Kubelet + +```bash +sudo systemctl restart kubelet +``` + +The kubelet generates a new private key and submits a fresh CSR to the API server upon startup. + +### Step 3: Approve Pending CSRs + +From a control-plane node or workstation with cluster admin access, approve the new CSR: + +```bash +kubectl get csr +kubectl certificate approve +``` + +If multiple nodes are affected, approve all pending CSRs at once: + +```bash +kubectl get csr -o name | xargs -I{} kubectl certificate approve {} +``` + +### Step 4: Verify Recovery + +Confirm the node returns to `Ready` status: + +```bash +kubectl get nodes +``` + +Check that CNI pods on the recovered node are running: + +```bash +kubectl get pods -n kube-system --field-selector spec.nodeName= +``` + +## Diagnostic Steps + +### Inspect API Server Certificate Errors + +```bash +kubectl logs -n kube-system kube-apiserver- --tail=200 | \ + grep "certificate has expired" +``` + +### List CSRs and Their Status + +```bash +kubectl get csr +``` + +Look for CSRs stuck in `Approved` without progressing to `Issued`. + +### Verify kube-controller-manager Health + +```bash +kubectl get pods -n kube-system | grep kube-controller-manager +kubectl logs -n kube-system kube-controller-manager- --tail=100 +``` + +If the kube-controller-manager is unhealthy or showing certificate-related errors itself, troubleshoot the control-plane certificates first. + +### Check CNI Pod Logs + +```bash +kubectl logs -n kube-system --tail=50 | grep -i "certificate\|csr\|x509" +``` From 18be374c0d91be85b932ab3b4f654cc8f90cd360 Mon Sep 17 00:00:00 2001 From: Komh Date: Wed, 22 Apr 2026 19:03:19 +0800 Subject: [PATCH 2/2] [security] Expired Node Certificates Cause CSR Backlog and CNI Pod Crashes --- ...s_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md | 89 +++++++++++++++---- 1 file changed, 73 insertions(+), 16 deletions(-) diff --git a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md index 8f1505ed..4c628c1e 100644 --- a/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md +++ b/docs/en/solutions/Expired_Node_Certificates_Cause_CSR_Backlog_and_CNI_Pod_Crashes.md @@ -32,54 +32,111 @@ The kube-controller-manager's certificate signing controller stops issuing certi ## Resolution -Recover each affected node by removing the stale kubelet PKI and forcing certificate reissuance. +Recover each affected node by restoring a bootstrap path so kubelet can request a fresh client certificate, then approving the CSR it emits. -### Step 1: Delete Stale Kubelet PKI +> **Important on modern kubeadm clusters:** `/etc/kubernetes/kubelet.conf` references `/var/lib/kubelet/pki/kubelet-client-current.pem` as a **file** — not an inline-embedded cert. When that file is missing and `/etc/kubernetes/bootstrap-kubelet.conf` has already been removed (the standard post-join state), kubelet has no valid identity, cannot submit a CSR, and will crash-loop with `failed to run Kubelet: unable to load bootstrap kubeconfig`. Simply `rm`-ing the PKI dir and restarting kubelet is **not enough** — you must also hand kubelet a fresh bootstrap kubeconfig first. -SSH into the affected node and remove the PKI directory: +### Step 1: Create a Bootstrap Token + +On a control-plane node (or anywhere with cluster-admin kubectl), mint a bootstrap token the affected node can use to authenticate: + +```bash +kubeadm token create --print-join-command +``` + +If `kubeadm` is not on the workstation, create the token Secret directly: + +```bash +TOKEN_ID=$(head -c 3 /dev/urandom | xxd -p) +TOKEN_SECRET=$(head -c 8 /dev/urandom | xxd -p) +EXP=$(date -u -d '+1 hour' +%Y-%m-%dT%H:%M:%SZ) +cat < -sudo rm -rf /var/lib/kubelet/pki +CA_DATA=$(sudo grep certificate-authority-data /etc/kubernetes/kubelet.conf | awk '{print $2}') +API_SERVER=$(sudo grep "server:" /etc/kubernetes/kubelet.conf | awk '{print $2}') +sudo tee /etc/kubernetes/bootstrap-kubelet.conf >/dev/null < +EOF +sudo chmod 600 /etc/kubernetes/bootstrap-kubelet.conf ``` -### Step 2: Restart the Kubelet +### Step 3: Remove Stale PKI and Restart Kubelet + +With the bootstrap kubeconfig in place, kubelet has a way to authenticate when the file-backed client cert is missing: ```bash +sudo rm -rf /var/lib/kubelet/pki sudo systemctl restart kubelet ``` -The kubelet generates a new private key and submits a fresh CSR to the API server upon startup. +Kubelet now boots in bootstrap mode, generates a fresh private key, and submits a CSR signed with the bootstrap token. -### Step 3: Approve Pending CSRs +### Step 4: Approve the Pending CSR -From a control-plane node or workstation with cluster admin access, approve the new CSR: +From the workstation with cluster-admin access: ```bash kubectl get csr kubectl certificate approve ``` -If multiple nodes are affected, approve all pending CSRs at once: +For multiple affected nodes at once: ```bash kubectl get csr -o name | xargs -I{} kubectl certificate approve {} ``` -### Step 4: Verify Recovery +The built-in CSR approver may also approve it automatically (via group `system:bootstrappers:kubeadm:default-node-token`), so this step is sometimes a no-op. -Confirm the node returns to `Ready` status: +### Step 5: Verify Recovery -```bash -kubectl get nodes -``` - -Check that CNI pods on the recovered node are running: +Confirm the node returns to `Ready` and the CNI pods there are running: ```bash +kubectl get nodes kubectl get pods -n kube-system --field-selector spec.nodeName= ``` +After the node stabilizes, `/etc/kubernetes/bootstrap-kubelet.conf` is no longer needed — you can delete it to prevent a stale token from lingering on disk. + ## Diagnostic Steps ### Inspect API Server Certificate Errors