From 8148962157a730cde13338eaaa59a6384e947adc Mon Sep 17 00:00:00 2001 From: Komh Date: Wed, 22 Apr 2026 23:31:46 +0800 Subject: [PATCH] [configure] Moving etcd to a Dedicated Disk on Control Plane Nodes --- ...a_Dedicated_Disk_on_Control_Plane_Nodes.md | 117 ++++++++++++++++++ 1 file changed, 117 insertions(+) create mode 100644 docs/en/solutions/Moving_etcd_to_a_Dedicated_Disk_on_Control_Plane_Nodes.md diff --git a/docs/en/solutions/Moving_etcd_to_a_Dedicated_Disk_on_Control_Plane_Nodes.md b/docs/en/solutions/Moving_etcd_to_a_Dedicated_Disk_on_Control_Plane_Nodes.md new file mode 100644 index 00000000..db32abd3 --- /dev/null +++ b/docs/en/solutions/Moving_etcd_to_a_Dedicated_Disk_on_Control_Plane_Nodes.md @@ -0,0 +1,117 @@ +--- +kind: + - How To +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- +## Issue + +etcd on the control plane plateaus in write latency once cluster activity grows. Symptoms include `etcdserver: request timed out` warnings, elevated leader-election churn, and slow `kubectl apply` round-trips. Profiling points at `fdatasync` on the WAL, and the control plane nodes are using the same filesystem for etcd data and container runtime / OS logs. A dedicated, fast disk for `/var/lib/etcd` is the recommended fix, but the cluster is already deployed. + +## Root Cause + +etcd is very sensitive to storage latency. Its consensus protocol relies on synchronous journal writes, so every `fsync` on the WAL is on the critical path of every write the cluster performs. When `/var/lib/etcd` shares a disk with: + +- container runtime image store and pod logs (noisy neighbour writes), +- journald / node OS activity, +- any workload using `emptyDir` on the same volume, + +the resulting IO contention shows up as long p99 `backend_commit_duration_seconds` and leader changes during load spikes. Moving `/var/lib/etcd` to a dedicated, low-latency disk (ideally NVMe) isolates these writes and restores predictable performance. + +On immutable-OS nodes, provisioning this additional disk is a platform-configure change, not an ad-hoc `mount` — the node reconciler must know about the mount so it survives reboots and image upgrades. + +## Resolution + +Plan the rollout as a **control-plane-only, one-node-at-a-time** operation. etcd must keep quorum while one member is migrated; doing two at once will break the cluster. + +1. **Pre-flight hardware checks.** Confirm the target disk meets etcd's latency budget before touching production. A commonly used benchmark is `fio` with 8KiB sequential writes and `fdatasync`; the P99 fsync should be comfortably below 10 ms, ideally under 2 ms: + + ```bash + kubectl debug node/ -it \ + --image=registry.k8s.io/e2e-test-images/busybox:1.36 \ + -- chroot /host fio --name=etcd-writelat \ + --rw=write --ioengine=sync --fdatasync=1 --bs=8k \ + --size=512m --numjobs=1 --runtime=60 --filename=/mnt/target-disk/test \ + --group_reporting + ``` + +2. **Prepare the disk via the platform's node-configuration surface.** Under `configure/clusters/nodes` add a disk declaration for the control-plane pool that: + + - partitions / formats the new device with `xfs` (recommended) or `ext4`, + - creates a systemd mount unit for `/var/lib/etcd` with `x-systemd.requires=`, + - applies SELinux label `container_var_lib_t` so the etcd container can read/write. + + Let ACP's node reconciler roll this change onto **one** control-plane node. Do not skip this step by editing the node manually — direct edits are reverted on the next reconcile on an immutable OS. + +3. **Drain and migrate the etcd member.** On the target node: + + ```bash + NODE= + kubectl cordon "$NODE" + # Stop the etcd static pod by moving the manifest out of the kubelet's + # static-pod directory. The exact path matches the platform's kubelet + # configuration. + kubectl debug node/$NODE -it --image=registry.k8s.io/e2e-test-images/busybox:1.36 \ + -- chroot /host sh -c ' + mkdir -p /etc/kubernetes/manifests.staged + mv /etc/kubernetes/manifests/etcd*.yaml /etc/kubernetes/manifests.staged/' + ``` + + Copy existing data onto the new disk **before** the mount hides the old path: + + ```bash + kubectl debug node/$NODE -it --image=registry.k8s.io/e2e-test-images/busybox:1.36 \ + -- chroot /host sh -c ' + rsync -aHAX /var/lib/etcd/ /mnt/new-etcd/ + mount /mnt/new-etcd /var/lib/etcd + ls /var/lib/etcd/' + ``` + + Return the manifest to its original path so the kubelet restarts etcd on the new mount: + + ```bash + kubectl debug node/$NODE -it --image=registry.k8s.io/e2e-test-images/busybox:1.36 \ + -- chroot /host sh -c ' + mv /etc/kubernetes/manifests.staged/etcd*.yaml /etc/kubernetes/manifests/' + kubectl uncordon "$NODE" + ``` + +4. **Verify quorum before the next node.** Wait for the etcd cluster to report all members healthy and for the Kubernetes control plane to settle: + + ```bash + kubectl -n kube-system get pod -l component=etcd -o wide + kubectl get cs # or equivalent health endpoint + ``` + + Only then start the same procedure on the next control-plane node. Proceed in strict serial order. + +5. **Back out plan.** If a member fails to rejoin within ~10 minutes, restore the original manifest (without the new mount) and investigate; never simultaneously restore multiple members to their old disks. + +## Diagnostic Steps + +Confirm etcd is currently running on a shared filesystem: + +```bash +for n in $(kubectl get node -l node-role.kubernetes.io/control-plane -o name); do + kubectl debug "$n" -it \ + --image=registry.k8s.io/e2e-test-images/busybox:1.36 \ + -- chroot /host sh -c 'df -h /var/lib/etcd && mount | grep /var/lib/etcd' +done +``` + +Baseline etcd write latency before and after migration: + +```bash +kubectl -n kube-system exec etcd- -- \ + etcdctl --endpoints=https://127.0.0.1:2379 \ + --cert=/etc/kubernetes/pki/etcd/server.crt \ + --key=/etc/kubernetes/pki/etcd/server.key \ + --cacert=/etc/kubernetes/pki/etcd/ca.crt \ + endpoint status -w table +``` + +Watch the WAL fsync histogram in Prometheus — `etcd_disk_wal_fsync_duration_seconds` should have its p99 drop substantially after the dedicated disk takes over. + +Expected migration window per node is approximately the time needed to copy the etcd data directory (usually seconds to a few minutes for clusters up to a few GB). If the copy stretches beyond the kubelet's static-pod grace period, temporarily scale `--etcd-election-timeout` on the remaining members or perform the migration during low-traffic hours.