From 22a9dadde3eebb7f2bae8e951a94cbe360743400 Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Wed, 22 Apr 2026 15:08:09 +0800
Subject: [PATCH 1/2] [configure] Backend Performance Requirements for etcd

---
 ...ckend_Performance_Requirements_for_etcd.md | 87 +++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
diff --git a/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md b/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
new file mode 100644
index 00000000..3d1209a3
--- /dev/null
+++ b/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
@@ -0,0 +1,87 @@
+---
+kind:
+   - Troubleshooting
+products:
+   - Alauda Container Platform
+ProductsVersion:
+   - 4.1.0,4.2.x
+---
+## Issue
+
+etcd performance degrades due to insufficient storage or network backend capabilities, producing log messages similar to the following:
+
+```
+etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
+etcdserver: server is likely overloaded
+etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
+wal: sync duration of xxxx s, expected less than 1s
+```
+
+These warnings indicate the storage subsystem or network cannot keep up with etcd's latency requirements.
+
+## Root Cause
+
+etcd is highly sensitive to storage and network performance. Any bottleneck in the backend infrastructure — slow disk I/O, high network latency, packet drops, or CPU saturation — directly impacts the ability of the etcd cluster to process writes and maintain leader-heartbeat deadlines. A request should normally complete in under 50 ms; durations exceeding 200 ms trigger warnings in the logs.
+
+## Resolution
+
+### Identify the Bottleneck
+
+Three common causes of etcd slowness:
+
+1. **Slow storage** — Disk I/O latency exceeds acceptable thresholds
+2. **CPU overload** — Control-plane nodes are overcommitted
+3. **Database size growth** — The etcd data file has grown beyond optimal size
+
+### Check Storage Performance with fio
+
+Run an I/O benchmark on each control-plane node to validate disk performance:
+
+```bash
+fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
+    --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+```
+
+The 99th percentile fdatasync latency must be under **10 ms**.
+
+### Monitor Key etcd Metrics
+
+Use Prometheus to track the following metrics:
+
+| Metric | Threshold | Meaning |
+|---|---|---|
+| `etcd_disk_wal_fsync_duration_seconds_bucket` (p99) | < 10 ms | WAL write latency |
+| `etcd_disk_backend_commit_duration_seconds_bucket` (p99) | < 25 ms | Backend commit latency |
+| `etcd_network_peer_round_trip_time_seconds_bucket` (p99) | < 50 ms | Peer-to-peer network RTT |
+| `etcd_mvcc_db_total_size_in_bytes` | < 2 GB (default quota) | Database size |
+
+### Network Health
+
+High network latency or packet drops between etcd members destabilize the cluster. Monitor network RTT and investigate any persistent packet loss on the control-plane network interface.
+
+### Database Defragmentation
+
+If the database size approaches the quota, perform manual defragmentation:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
+  --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key
+```
+
+## Diagnostic Steps
+
+Check etcd logs for latency warnings:
+
+```bash
+kubectl logs -n kube-system etcd-<node-name> --tail=100 | grep -E "took too long|heartbeat|overloaded"
+```
+
+Query etcd metrics directly via the Prometheus endpoint:
+
+```bash
+kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \
+  | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size"
+```

From e0e0467d0d1eb6b20dce7ce1cf1ffeae3bd702bf Mon Sep 17 00:00:00 2001
From: Komh <mail@guojing.io>
Date: Wed, 22 Apr 2026 17:00:42 +0800
Subject: [PATCH 2/2] [configure] Backend Performance Requirements for etcd

---
 .../Backend_Performance_Requirements_for_etcd.md     | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md b/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
index 3d1209a3..96d18cae 100644
--- a/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
+++ b/docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
@@ -79,9 +79,15 @@ Check etcd logs for latency warnings:
 kubectl logs -n kube-system etcd-<node-name> --tail=100 | grep -E "took too long|heartbeat|overloaded"
 ```
 
-Query etcd metrics directly via the Prometheus endpoint:
+Query etcd metrics directly via the Prometheus endpoint. The etcd container image ships without an HTTP client on most distributions, so exec'ing `wget`/`curl` inside it is not reliable. Use `kubectl port-forward` against the pod and query from the workstation:
 
 ```bash
-kubectl exec -n kube-system etcd-<node-name> -- wget -qO- http://127.0.0.1:2381/metrics 2>/dev/null \
-  | grep -E "etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size"
+# Terminal 1: forward the metrics port to a local port.
+kubectl port-forward -n kube-system pod/etcd-<node-name> 12381:2381
+
+# Terminal 2: query and filter the metrics of interest.
+curl -s http://127.0.0.1:12381/metrics \
+  | grep -E "^(etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size)"
 ```
+
+If the cluster has Prometheus scraping etcd, the same metrics are also available via PromQL — typically the cleanest path in a production environment.