Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 93 additions & 0 deletions docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
kind:
- Troubleshooting
products:
- Alauda Container Platform
ProductsVersion:
- 4.1.0,4.2.x
---
## Issue

etcd performance degrades due to insufficient storage or network backend capabilities, producing log messages similar to the following:

```
etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
etcdserver: server is likely overloaded
etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
wal: sync duration of xxxx s, expected less than 1s
```
Comment on lines +13 to +18
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language tag to the log code fence.

The fenced block at Line 13 is missing a language identifier, which triggers markdownlint MD040.

✅ Suggested patch
-```
+```text
 etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
 etcdserver: server is likely overloaded
 etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
 wal: sync duration of xxxx s, expected less than 1s
</details>

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>

[warning] 13-13: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/en/solutions/Backend_Performance_Requirements_for_etcd.md around lines
13 - 18, The fenced log block is missing a language tag which triggers
markdownlint MD040; update the code fence that contains the lines beginning with
"etcdserver: failed to send out heartbeat..." and the subsequent etcdserver/wal
lines by adding a language identifier (e.g., "text") after the opening so the block reads liketext to satisfy the linter and preserve formatting.


</details>

<!-- fingerprinting:phantom:triton:hawk:3a43dc90-80f7-42a3-bcac-1dedf608d1de -->

<!-- This is an auto-generated comment by CodeRabbit -->


These warnings indicate the storage subsystem or network cannot keep up with etcd's latency requirements.

## Root Cause

etcd is highly sensitive to storage and network performance. Any bottleneck in the backend infrastructure — slow disk I/O, high network latency, packet drops, or CPU saturation — directly impacts the ability of the etcd cluster to process writes and maintain leader-heartbeat deadlines. A request should normally complete in under 50 ms; durations exceeding 200 ms trigger warnings in the logs.

## Resolution

### Identify the Bottleneck

Three common causes of etcd slowness:

1. **Slow storage** — Disk I/O latency exceeds acceptable thresholds
2. **CPU overload** — Control-plane nodes are overcommitted
3. **Database size growth** — The etcd data file has grown beyond optimal size

### Check Storage Performance with fio

Run an I/O benchmark on each control-plane node to validate disk performance:

```bash
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
```

The 99th percentile fdatasync latency must be under **10 ms**.
Comment on lines +41 to +45
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In fio, what is the behavioral difference between --fsync=1 and --fdatasync=1, and which one best matches etcd WAL fdatasync latency validation guidance?

💡 Result:

In fio, --fsync=1 issues fsync(2) after every block written, syncing both data and metadata to disk. --fdatasync=1 issues fdatasync(2) after every block, syncing only data (not metadata unless required for data retrieval), falling back to fsync on unsupported platforms. --fdatasync=1 best matches etcd WAL fdatasync latency validation guidance. Etcd primarily uses fdatasync for WAL syncs (fsync only on create/close/repair/cut), and official validation uses fio with --fdatasync=1 (e.g., fio --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --size=XXm), checking 99th percentile fdatasync latency <10ms.

Citations:


🏁 Script executed:

fd "Backend_Performance_Requirements_for_etcd.md" --type f

Repository: alauda/knowledge

Length of output: 122


🏁 Script executed:

sed -n '41,45p' docs/en/solutions/Backend_Performance_Requirements_for_etcd.md

Repository: alauda/knowledge

Length of output: 266


Change --fsync=1 to --fdatasync=1 to match the documented threshold.

The fio command uses --fsync=1 (which syncs data and metadata), but the threshold requires fdatasync latency <10ms (which syncs data only). This mismatch means the benchmark measures the wrong primitive. etcd WAL operations primarily use fdatasync, not fsync. Update the command to --fdatasync=1 to correctly validate etcd performance.

Suggested patch
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
-    --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+    --rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
```
The 99th percentile fdatasync latency must be under **10 ms**.
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
--rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
41 - 45, The fio benchmark currently uses the --fsync=1 flag which measures
fsync (data+metadata) but the documented SLA and WAL behavior require measuring
fdatasync; update the fio invocation in the docs (the command line containing
fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1 so the
99th-percentile fdatasync latency threshold (<10 ms) is validated correctly.


### Monitor Key etcd Metrics

Use Prometheus to track the following metrics:

| Metric | Threshold | Meaning |
|---|---|---|
| `etcd_disk_wal_fsync_duration_seconds_bucket` (p99) | < 10 ms | WAL write latency |
| `etcd_disk_backend_commit_duration_seconds_bucket` (p99) | < 25 ms | Backend commit latency |
| `etcd_network_peer_round_trip_time_seconds_bucket` (p99) | < 50 ms | Peer-to-peer network RTT |
| `etcd_mvcc_db_total_size_in_bytes` | < 2 GB (default quota) | Database size |

### Network Health

High network latency or packet drops between etcd members destabilize the cluster. Monitor network RTT and investigate any persistent packet loss on the control-plane network interface.

### Database Defragmentation

If the database size approaches the quota, perform manual defragmentation:

```bash
kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
```
Comment on lines +64 to +72
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a defrag safety note (one member at a time).

This runbook should explicitly instruct sequential defragmentation (not all members concurrently) to reduce control-plane disruption risk.

✅ Suggested patch
 ### Database Defragmentation
 
 If the database size approaches the quota, perform manual defragmentation:
+Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.
 
 ```bash
 kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
If the database size approaches the quota, perform manual defragmentation:
```bash
kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
```
If the database size approaches the quota, perform manual defragmentation:
Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
64 - 72, Add an explicit safety note to the etcd defragmentation snippet
instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.


## Diagnostic Steps

Check etcd logs for latency warnings:

```bash
kubectl logs -n kube-system etcd-<node-name> --tail=100 | grep -E "took too long|heartbeat|overloaded"
```

Query etcd metrics directly via the Prometheus endpoint. The etcd container image ships without an HTTP client on most distributions, so exec'ing `wget`/`curl` inside it is not reliable. Use `kubectl port-forward` against the pod and query from the workstation:

```bash
# Terminal 1: forward the metrics port to a local port.
kubectl port-forward -n kube-system pod/etcd-<node-name> 12381:2381

# Terminal 2: query and filter the metrics of interest.
curl -s http://127.0.0.1:12381/metrics \
| grep -E "^(etcd_disk_wal_fsync|etcd_disk_backend_commit|etcd_mvcc_db_total_size)"
```

If the cluster has Prometheus scraping etcd, the same metrics are also available via PromQL — typically the cleanest path in a production environment.
Loading