Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
---
kind:
- How To
products:
- Alauda Container Platform
ProductsVersion:
- 4.1.0,4.2.x
---
## Overview

etcd is the source of truth for every Kubernetes object. Losing it — through disk corruption, simultaneous node failure, or accidental delete — without a recent backup is a full cluster rebuild. Automating a regular on-host snapshot is the cheapest and most effective disaster-recovery primitive the platform can maintain.

The preferred mechanism on ACP is the platform's own backup surface under `configure/backup`, which orchestrates snapshots, applies retention, and stores them off-cluster. When platform-managed backup is not available (early bring-up, air-gapped labs, or when an operator wants an additional local copy), a least-privilege CronJob calling `etcdctl snapshot` on each control-plane node is a reasonable fallback.

## Resolution

### Preferred: Platform-Managed Backup

Use ACP's configure/backup page to enable control-plane backups for the cluster. Choose a schedule, a retention window, and a target storage location (S3-compatible object store is a common choice). The platform handles:

- consistent invocation on **every** control-plane node, not just the first one a script happens to pick,
- credential management for the target store,
- retention / garbage collection,
- integration with restore tooling (which is the half of DR that people often forget to validate).

A platform-managed backup removes the need for privileged pods in user namespaces; prefer it whenever it is available.

### Fallback: Scheduled Snapshot Job

If the platform surface is not yet enabled, run a CronJob that calls `etcdctl snapshot save` on each control-plane node. Keep permissions tight: the Job needs to read etcd TLS material and write to a well-known directory on each control-plane node, and nothing else.

1. **Create a dedicated namespace and ServiceAccount.**

```bash
kubectl create namespace etcd-backup
kubectl -n etcd-backup create serviceaccount etcd-backup
```

2. **Grant only the cluster-wide reads the Job needs.** Node access is required to enumerate control-plane nodes; `pods/exec` on `kube-system` is required to issue `etcdctl snapshot`. Avoid any privilege that is not listed.

```yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: etcd-backup
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: etcd-backup
subjects:
- kind: ServiceAccount
name: etcd-backup
namespace: etcd-backup
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: etcd-backup
```

3. **Schedule the snapshot.** The Job below runs once a day, iterates over each control-plane pod, takes a snapshot inside the etcd container, and deletes snapshots older than 7 days. Adjust the schedule and retention to your RPO target.

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-snapshot
namespace: etcd-backup
spec:
schedule: "7 3 * * *" # 03:07 UTC daily
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 0
ttlSecondsAfterFinished: 3600
template:
spec:
serviceAccountName: etcd-backup
restartPolicy: Never
containers:
- name: snapshot
image: bitnami/kubectl:1.33
command:
- /bin/bash
- -ec
- |
set -o pipefail
for pod in $(kubectl -n kube-system get pod \
-l component=etcd \
-o jsonpath='{.items[*].metadata.name}'); do
dest="/var/lib/etcd/backup/snapshot-$(date -u +%Y%m%dT%H%M%SZ).db"
echo "===== $pod -> $dest"
kubectl -n kube-system exec "$pod" -c etcd -- sh -c "
mkdir -p \$(dirname $dest) &&
ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save $dest &&
find \$(dirname $dest) -name 'snapshot-*.db' -mtime +7 -delete
"
done
```

4. **Ship snapshots off-cluster.** A snapshot that only lives on the control-plane node does not survive the failure mode it was meant to cover. Pair the CronJob with a sidecar or separate Job that uploads new `snapshot-*.db` files to an object store your restore tooling can reach — `rclone`, `aws s3 cp`, or an init-container that mounts a node-local path and streams to a bucket.

5. **Restore drills.** A backup whose restore has never been exercised is guesswork. Restore into a disposable test cluster quarterly and document the runbook. The exact restore procedure is platform-specific (it rebuilds the etcd static pod from the snapshot); reach for the platform's DR documentation rather than improvising under pressure.

## Diagnostic Steps

Confirm the CronJob ran and left artefacts on the expected nodes:

```bash
kubectl -n etcd-backup get jobs --sort-by=.status.startTime | tail -n 10
kubectl -n etcd-backup logs job/$(kubectl -n etcd-backup get job -o jsonpath='{.items[-1].metadata.name}')
```

Inspect a control-plane node for snapshot files:

```bash
NODE=<control-plane-1>
kubectl debug node/$NODE -it \
--image=registry.k8s.io/e2e-test-images/busybox:1.36 \
-- chroot /host ls -lh /var/lib/etcd/backup/ 2>/dev/null
```

Sanity-check the snapshot's integrity before relying on it:

```bash
kubectl -n kube-system exec etcd-<host> -c etcd -- \
sh -c 'ETCDCTL_API=3 etcdctl snapshot status \
/var/lib/etcd/backup/<file.db> -w table'
```

Expected output lists the snapshot hash, total keys, and total size — an empty or truncated snapshot usually fails the status command outright. If `snapshot save` returns `context deadline exceeded`, raise the command timeout via `--dial-timeout` and `--command-timeout`; a healthy etcd should complete a snapshot in well under a minute even on clusters with several GB of data.
Loading